jorisvandenbossche commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1825290118

   > Something weird is that most columns out of this file have a single chunk, 
even though the file has 21 row groups. This doesn't look right:
   
   That's because of the use of `pq.ParquetFile.read(..)` (assuming you were 
using that here; @mrocklin's gist is using that, but your gist from earlier was 
using `pq.read_table`, which should result in much more chunks)
   
   This `ParquetFile.read()` functions binds to 
`parquet::arrow::FileReader::ReadTable`, and I have noticed before that for 
some reason this concatenates the chunks somewhere in the read path. 
   On the other hand, `ParquetFile.iter_batches()` binds to 
`FileReader::GetRecordBatchReader` and the Dataset API / `pq.read_table` to 
`FileReader::GetRecordBatchGenerator`, and those two APIs will returns smaller 
batches (first per row group, but they also have a batch_size and will 
typically also return multiple chunks per row group).
   
   ```python
   # this file has 21 row groups
   >>> file_path = "lineitem_0002072d-7283-43ae-b645-b26640318053.parquet"
   >>> f = pq.ParquetFile(file_path)
   
   # reading with ParquetFile.read gives a single chunk of data
   >>> f.read()["l_orderkey"].num_chunks
   1
   # even when using the read_row_groups API 
   >>> f.read_row_groups([0, 1])["l_orderkey"].num_chunks
   1
   # only when using iter_batches, it's of course multiple chunks. The default 
batch_size here is 2**16,
   # which even results in more batches than the number of row groups
   >>> pa.Table.from_batches(f.iter_batches())["l_orderkey"].num_chunks
   40
   # we can make the batch_size larger
   >>> 
pa.Table.from_batches(f.iter_batches(batch_size=128000))["l_orderkey"].num_chunks
   21
   # strangely it still seems to concatenate *across* row groups when further 
increasing the batch size
   >>> 
pa.Table.from_batches(f.iter_batches(batch_size=2**17))["l_orderkey"].num_chunks
   20
   
   # pq.read_table uses the datasets API, but doesn't allow passing a batch size
   >>> pq.read_table(file_path)["l_orderkey"].num_chunks
   21
   # in the datasets API, now the default batch size is 2**17 instead of 2**16 
... 
   >>> import pyarrow.dataset as ds
   >>> ds.dataset(file_path, 
format="parquet").to_table()["l_orderkey"].num_chunks
   21
   # we can lower it (now each individual row group gets split, no combination 
of data of multiple row groups,
   # I think because the GetRecordBatchGenerator uses a sub-generator per row 
group instead of a single iterator
   # for the whole file as GetRecordBatchReader does)
   >>> ds.dataset(file_path, 
format="parquet").to_table(batch_size=2**16)["l_orderkey"].num_chunks
   42
   ```
   
   So in summary, this is also a bit of a mess on our side (there are many 
different ways to read a parquet file ..). I had been planning to bring up that 
you might want to _not_ use `ParquetFile().read()` in dask, because it's a bit 
slower because of returning a single chunk. Although if it's for the use case 
to convert to pandas later on, it might also not matter that much (although 
when using pyarrow strings, then it can matter). 
   
   On the Arrow side, we should maybe consider to make the default batch size a 
bit more uniform, and see if we want to use an actual batch size for the 
ReadTable code path as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to