[I] Parquet: Match generated RecordBatch number of rows to Parquet row group size [arrow-rs]

via GitHub Wed, 31 Jan 2024 10:33:47 -0800


kylebarron opened a new issue, #5356:
URL: https://github.com/apache/arrow-rs/issues/5356

**Which part is this question about**

Parquet record batch reader row group size

**Describe your question**

I have a use case where the size of each RecordBatch is chosen very
intentionally by the writer (pyarrow). In this case, pyarrow writes each Arrow
RecordBatch to a single Parquet row group. But it appears impossible with the
`parquet` crate to recreate a sequence of RecordBatches with the same original
row groups. In particular
[`with_batch_size`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_batch_size)
will default to `1024` if unset, and there appears no way to match the row
group size.

Coming from other Parquet implementations like pyarrow and arrow2, it seems
surprising to not have this option.

**Additional context**

I asked this [here on
discord](https://discord.com/channels/885562378132000778/885562378132000781/1193645000957886494)
but didn't really get a satisfactory answer. I understand that for e.g.
DataFusion use cases it's very valuable to push down row filtering to the page
level, but feel there should be _some way_ to recreate Arrow batches.

Related to this issue on the writing side:
https://github.com/apache/arrow-rs/issues/5004.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parquet: Match generated RecordBatch number of rows to Parquet row group size [arrow-rs]

Reply via email to