Hello,

I am using PyArrow, and encountering an OOM issue when reading the Parquet
file. My end goal is to sample just a few rows (~5 rows) from any Parquet
file, to estimate in-memory data size of the whole file, based on sampled
rows.

We tried the following approaches:
* `to_batches(batch_size=5)` -
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDataset.html#pyarrow.dataset.FileSystemDataset.to_batches
* `head(num_rows=5, batch_size=5)` -
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head

But with both approaches, we encountered OOM issues when just reading 5
rows several times from ~2GB Parquet file.Then we tried
`to_batches(batch_size=100000)`, and it works fine without OOM issue.

I am confused and what to know is the underlying behavior in C++ Arrow
Parquet reader, when setting batch_size to be small? I guess there might be
some exponential overhead associated with batch_size when its value is
small.

Thanks,
Cheng Su

Reply via email to