kylebarron opened a new issue, #5356: URL: https://github.com/apache/arrow-rs/issues/5356
**Which part is this question about** Parquet record batch reader row group size **Describe your question** I have a use case where the size of each RecordBatch is chosen very intentionally by the writer (pyarrow). In this case, pyarrow writes each Arrow RecordBatch to a single Parquet row group. But it appears impossible with the `parquet` crate to recreate a sequence of RecordBatches with the same original row groups. In particular [`with_batch_size`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_batch_size) will default to `1024` if unset, and there appears no way to match the row group size. Coming from other Parquet implementations like pyarrow and arrow2, it seems surprising to not have this option. **Additional context** I asked this [here on discord](https://discord.com/channels/885562378132000778/885562378132000781/1193645000957886494) but didn't really get a satisfactory answer. I understand that for e.g. DataFusion use cases it's very valuable to push down row filtering to the page level, but feel there should be _some way_ to recreate Arrow batches. Related to this issue on the writing side: https://github.com/apache/arrow-rs/issues/5004. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
