Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-02-24 Thread Weston Pace
Thanks for reporting this. It seems a regression crept into 7.0.0 that accidentally disabled parallel column decoding when pyarrow.parquet.read_table is called with a single file. I have filed [1] and should have a fix for it before the next release. As a workaround you can use the datasets API

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-02-24 Thread Shawn Zeng
I am using a public benchmark. The origin file is https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file as a parquet file, with compression=snappy and use_dictionary=true. The data has ~20M rows and

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-02-24 Thread Weston Pace
That doesn't really solve it but just confirms that the problem is the newer datasets logic. I need more information to really know what is going on as this still seems like a problem. How many row groups and how many columns does your file have? Or do you have a sample parquet file that shows

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-02-24 Thread Shawn Zeng
use_legacy_dataset=True fixes the problem. Could you explain a little about the reason? Thanks! Weston Pace 于2022年2月24日周四 13:44写道: > What version of pyarrow are you using? What's your OS? Is the file on a > local disk or S3? How many row groups are in your file? > > A difference of that much