Hey Westion, Thanks for the suggestion! I will give it a try.
Best, Haocheng On Thu, Mar 2, 2023 at 2:29 PM Weston Pace <[email protected]> wrote: > If you are working in C++ there are a few interfaces you might be > interested in. > > The simplest high level API for this would be to use Acero and create a > write node. This is what pyarrow uses (though a little indirectly at the > moment). There is a brief example here[1]. I'd be happy to answer > specific questions too. The input to Acero needs to be a stream of record > batches. You could wrap your bespoke reader in a RecordBatchReader and > then use "record_batch_reader_source". Putting it all together you would > get something that looks like (in pseudocode): > > ``` > BespokeReader reader = OpenBespokeReaderForGiantFile(...); > RecordBatchReader rb_reader = reader.ToRecordBatchReader(); > RecordBatchReaderSourceNodeOptions source_options{rb_reader}; > // This is where you specify which columns to partition on and specify > // that you want to use hive-style partitioning > FileSystemDatasetWriteOptions write_options = CreateWriteOptions(...); > WriteNodeOptions write_node_options(write_options); > Declaration plan = Declaration::Sequence({ > {"record_batch_reader_source", source_options}, > {"write", write_node_options} > }); > Status final_result = DeclarationToStatus(plan); > ``` > > Note, the above is assuming you are using Arrow 11.0.0. The dataset > writer[2] is probably the core component for writing data to a dataset. So > if you want to bypass Acero you could use it directly. However, the > partitioning logic happens in the write node (and not the dataset writer) > today so you would need to duplicate that logic. > > [1] > https://github.com/apache/arrow/blob/main/cpp/examples/arrow/execution_plan_documentation_examples.cc#L647 > [2] > https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.h > > On Wed, Mar 1, 2023 at 2:57 PM Haocheng Liu <[email protected]> wrote: > >> Hi Arrow community, >> >> Hope this email finds you well. I'm working on a project to convert >> a bespoke format into parquet format, where each file contains time series >> data and can be tens of gigabytes large on a daily basis. >> >> I've successfully created a binary with parquet::StreamingWriter to >> convert the file to one big parquet file. >> Next I would like to 1) break it into small files - let's say 1 hour per >> sub file - and 2) store them in a hive-style manner in *C++*. From the >> official >> docs <https://arrow.apache.org/docs/cpp/tutorials/datasets_tutorial.html> I >> failed to find related information. Can folks please guide where the docs >> are or if it's doable right now in C++? >> >> Best regards >> Haocheng Liu >> >> >> -- >> Best regards >> > -- Best regards
