Thanks for reporting this. It seems a regression crept into 7.0.0
that accidentally disabled parallel column decoding when
pyarrow.parquet.read_table is called with a single file. I have filed
[1] and should have a fix for it before the next release. As a
workaround you can use the datasets API
I am using a public benchmark. The origin file is
https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
. I used pyarrow version 7.0.0 and pq.write_table api to write the csv file
as a parquet file, with compression=snappy and use_dictionary=true. The
data has ~20M rows and
That doesn't really solve it but just confirms that the problem is the
newer datasets logic. I need more information to really know what is going
on as this still seems like a problem.
How many row groups and how many columns does your file have? Or do you
have a sample parquet file that shows
use_legacy_dataset=True fixes the problem. Could you explain a little about
the reason? Thanks!
Weston Pace 于2022年2月24日周四 13:44写道:
> What version of pyarrow are you using? What's your OS? Is the file on a
> local disk or S3? How many row groups are in your file?
>
> A difference of that much