Re: [C++] How to write parquet file in hive style using parquet::StreamingWriter

Weston Pace Thu, 02 Mar 2023 11:29:40 -0800

If you are working in C++ there are a few interfaces you might be
interested in.

The simplest high level API for this would be to use Acero and create a
write node.  This is what pyarrow uses (though a little indirectly at the
moment).  There is a brief example here[1].  I'd be happy to answer
specific questions too.  The input to Acero needs to be a stream of record
batches.  You could wrap your bespoke reader in a RecordBatchReader and
then use "record_batch_reader_source".  Putting it all together you would
get something that looks like (in pseudocode):

```
BespokeReader reader = OpenBespokeReaderForGiantFile(...);
RecordBatchReader rb_reader = reader.ToRecordBatchReader();
RecordBatchReaderSourceNodeOptions source_options{rb_reader};
// This is where you specify which columns to partition on and specify
// that you want to use hive-style partitioning
FileSystemDatasetWriteOptions write_options = CreateWriteOptions(...);
WriteNodeOptions write_node_options(write_options);
Declaration plan = Declaration::Sequence({
  {"record_batch_reader_source", source_options},
  {"write", write_node_options}
});
Status final_result = DeclarationToStatus(plan);
```

Note, the above is assuming you are using Arrow 11.0.0.  The dataset
writer[2] is probably the core component for writing data to a dataset.  So
if you want to bypass Acero you could use it directly.  However, the
partitioning logic happens in the write node (and not the dataset writer)
today so you would need to duplicate that logic.

[1]
https://github.com/apache/arrow/blob/main/cpp/examples/arrow/execution_plan_documentation_examples.cc#L647
[2]
https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.h

On Wed, Mar 1, 2023 at 2:57 PM Haocheng Liu <[email protected]> wrote:

> Hi Arrow community,
>
> Hope this email finds you well. I'm working on a project to convert
> a bespoke format into parquet format, where each file contains time series
> data and can be tens of gigabytes large on a daily basis.
>
> I've successfully created a binary with parquet::StreamingWriter to
> convert the file to one big parquet file.
> Next I would like to 1) break it into small files - let's say 1 hour per
> sub file - and 2) store them in a hive-style manner in *C++*. From the 
> official
> docs <https://arrow.apache.org/docs/cpp/tutorials/datasets_tutorial.html> I
> failed to find related information. Can folks please guide where the docs
> are or if it's doable right now in C++?
>
> Best regards
> Haocheng Liu
>
>
> --
> Best regards
>

Re: [C++] How to write parquet file in hive style using parquet::StreamingWriter

Reply via email to