Hello, I am using PyArrow, and encountering an OOM issue when reading the Parquet file. My end goal is to sample just a few rows (~5 rows) from any Parquet file, to estimate in-memory data size of the whole file, based on sampled rows.
We tried the following approaches: * `to_batches(batch_size=5)` - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDataset.html#pyarrow.dataset.FileSystemDataset.to_batches * `head(num_rows=5, batch_size=5)` - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head But with both approaches, we encountered OOM issues when just reading 5 rows several times from ~2GB Parquet file.Then we tried `to_batches(batch_size=100000)`, and it works fine without OOM issue. I am confused and what to know is the underlying behavior in C++ Arrow Parquet reader, when setting batch_size to be small? I guess there might be some exponential overhead associated with batch_size when its value is small. Thanks, Cheng Su
