[C++][Python] Recommend way to just read several rows from Parquet

Cheng Su Fri, 02 Sep 2022 11:32:44 -0700

Hello,

I am using PyArrow, and encountering an OOM issue when reading the Parquet
file. My end goal is to sample just a few rows (~5 rows) from any Parquet
file, to estimate in-memory data size of the whole file, based on sampled
rows.


We tried the following approaches:
* `to_batches(batch_size=5)` -
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDataset.html#pyarrow.dataset.FileSystemDataset.to_batches
* `head(num_rows=5, batch_size=5)` -
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head

But with both approaches, we encountered OOM issues when just reading 5
rows several times from ~2GB Parquet file.Then we tried
`to_batches(batch_size=100000)`, and it works fine without OOM issue.

I am confused and what to know is the underlying behavior in C++ Arrow
Parquet reader, when setting batch_size to be small? I guess there might be
some exponential overhead associated with batch_size when its value is
small.

Thanks,
Cheng Su

[C++][Python] Recommend way to just read several rows from Parquet

Reply via email to