jorisvandenbossche commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1825290118
> Something weird is that most columns out of this file have a single chunk, even though the file has 21 row groups. This doesn't look right: That's because of the use of `pq.ParquetFile.read(..)` (assuming you were using that here; @mrocklin's gist is using that, but your gist from earlier was using `pq.read_table`, which should result in much more chunks) This `ParquetFile.read()` functions binds to `parquet::arrow::FileReader::ReadTable`, and I have noticed before that for some reason this concatenates the chunks somewhere in the read path. On the other hand, `ParquetFile.iter_batches()` binds to `FileReader::GetRecordBatchReader` and the Dataset API / `pq.read_table` to `FileReader::GetRecordBatchGenerator`, and those two APIs will returns smaller batches (first per row group, but they also have a batch_size and will typically also return multiple chunks per row group). ```python # this file has 21 row groups >>> file_path = "lineitem_0002072d-7283-43ae-b645-b26640318053.parquet" >>> f = pq.ParquetFile(file_path) # reading with ParquetFile.read gives a single chunk of data >>> f.read()["l_orderkey"].num_chunks 1 # even when using the read_row_groups API >>> f.read_row_groups([0, 1])["l_orderkey"].num_chunks 1 # only when using iter_batches, it's of course multiple chunks. The default batch_size here is 2**16, # which even results in more batches than the number of row groups >>> pa.Table.from_batches(f.iter_batches())["l_orderkey"].num_chunks 40 # we can make the batch_size larger >>> pa.Table.from_batches(f.iter_batches(batch_size=128000))["l_orderkey"].num_chunks 21 # strangely it still seems to concatenate *across* row groups when further increasing the batch size >>> pa.Table.from_batches(f.iter_batches(batch_size=2**17))["l_orderkey"].num_chunks 20 # pq.read_table uses the datasets API, but doesn't allow passing a batch size >>> pq.read_table(file_path)["l_orderkey"].num_chunks 21 # in the datasets API, now the default batch size is 2**17 instead of 2**16 ... >>> import pyarrow.dataset as ds >>> ds.dataset(file_path, format="parquet").to_table()["l_orderkey"].num_chunks 21 # we can lower it (now each individual row group gets split, no combination of data of multiple row groups, # I think because the GetRecordBatchGenerator uses a sub-generator per row group instead of a single iterator # for the whole file as GetRecordBatchReader does) >>> ds.dataset(file_path, format="parquet").to_table(batch_size=2**16)["l_orderkey"].num_chunks 42 ``` So in summary, this is also a bit of a mess on our side (there are many different ways to read a parquet file ..). I had been planning to bring up that you might want to _not_ use `ParquetFile().read()` in dask, because it's a bit slower because of returning a single chunk. Although if it's for the use case to convert to pandas later on, it might also not matter that much (although when using pyarrow strings, then it can matter). On the Arrow side, we should maybe consider to make the default batch size a bit more uniform, and see if we want to use an actual batch size for the ReadTable code path as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
