Hey Westion,

Thanks for the suggestion! I will give it a try.

Best,
Haocheng

On Thu, Mar 2, 2023 at 2:29 PM Weston Pace <[email protected]> wrote:

> If you are working in C++ there are a few interfaces you might be
> interested in.
>
> The simplest high level API for this would be to use Acero and create a
> write node.  This is what pyarrow uses (though a little indirectly at the
> moment).  There is a brief example here[1].  I'd be happy to answer
> specific questions too.  The input to Acero needs to be a stream of record
> batches.  You could wrap your bespoke reader in a RecordBatchReader and
> then use "record_batch_reader_source".  Putting it all together you would
> get something that looks like (in pseudocode):
>
> ```
> BespokeReader reader = OpenBespokeReaderForGiantFile(...);
> RecordBatchReader rb_reader = reader.ToRecordBatchReader();
> RecordBatchReaderSourceNodeOptions source_options{rb_reader};
> // This is where you specify which columns to partition on and specify
> // that you want to use hive-style partitioning
> FileSystemDatasetWriteOptions write_options = CreateWriteOptions(...);
> WriteNodeOptions write_node_options(write_options);
> Declaration plan = Declaration::Sequence({
>   {"record_batch_reader_source", source_options},
>   {"write", write_node_options}
> });
> Status final_result = DeclarationToStatus(plan);
> ```
>
> Note, the above is assuming you are using Arrow 11.0.0.  The dataset
> writer[2] is probably the core component for writing data to a dataset.  So
> if you want to bypass Acero you could use it directly.  However, the
> partitioning logic happens in the write node (and not the dataset writer)
> today so you would need to duplicate that logic.
>
> [1]
> https://github.com/apache/arrow/blob/main/cpp/examples/arrow/execution_plan_documentation_examples.cc#L647
> [2]
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.h
>
> On Wed, Mar 1, 2023 at 2:57 PM Haocheng Liu <[email protected]> wrote:
>
>> Hi Arrow community,
>>
>> Hope this email finds you well. I'm working on a project to convert
>> a bespoke format into parquet format, where each file contains time series
>> data and can be tens of gigabytes large on a daily basis.
>>
>> I've successfully created a binary with parquet::StreamingWriter to
>> convert the file to one big parquet file.
>> Next I would like to 1) break it into small files - let's say 1 hour per
>> sub file - and 2) store them in a hive-style manner in *C++*. From the 
>> official
>> docs <https://arrow.apache.org/docs/cpp/tutorials/datasets_tutorial.html> I
>> failed to find related information. Can folks please guide where the docs
>> are or if it's doable right now in C++?
>>
>> Best regards
>> Haocheng Liu
>>
>>
>> --
>> Best regards
>>
>

-- 
Best regards

Reply via email to