Hi Wes, Thanks for your quick response.
Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: numpy: 1.16.5 pandas: 0.25.1 pyarrow: 0.14.1 It looks like 0.15 is close, so I can wait for that. Theoretically I see three components driving the performance: 1) The cost of locating the column (directory overhead) 2) The overhead of reading a single column. (reading and processing meta data, setting up for reading) 3) Bulk reading and unmarshalling/decoding the data. Only 1) would be impacted by the number of columns, but if you’re reading everything ideally this would not be a problem. Based on an initial cursory look at the Parquet format I guess the index and the column meta-data might need to be read in full so I can see how that might slow down reading only a few columns out of a large set. But that was not really the case here? What would you suggest for looking into the date index slow-down? Cheers, Maarten. > On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > hi Maarten, > > Are you using the master branch or 0.14.1? There are a number of > performance regressions in 0.14.0/0.14.1 that are addressed in the > master branch, to appear as 0.15.0 relatively soon. > > As a file format, Parquet (and columnar formats in general) is not > known to perform well with more than 1000 columns. > > On the other items, we'd be happy to work with you to dig through the > performance issues you're seeing. > > Thanks > Wes > > On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <maart...@xs4all.nl> wrote: >> >> Greetings, >> >> We have Pandas DataFrames with typically about 6,000 rows using >> DateTimeIndex. >> They have about 20,000 columns with integer column labels, and data with a >> dtype of float32. >> >> We’d like to store these dataframes with parquet, using the ability to read >> a subset of columns and to store meta-data with the file. >> >> We’ve found the reading performance less than expected compared to the >> published benchmarks (e.g. Wes’ blog post). >> >> Using a modified version of his script we did reproduce his results (~ 1GB/s >> for high entropy, no dict on MacBook pro) >> >> But there seem to be three factors that contribute to the slowdown for our >> datasets: >> >> - DateTimeIndex is much slower then a Int index (we see about a factor 5). >> - The number of columns impact reading speed significantly (factor ~2 going >> from 16 to 16,000 columns) >> - The ‘use_pandas_metadata=True’ slows down reading significantly and >> appears unnecessary? (about 40%) >> >> Are there ways we could speedup the reading? Should we use a different >> layout? >> >> Thanks for your help and insights! >> >> Cheers, >> Maarten >> >> >> ps. the routines we used: >> >> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None: >> table = pa.Table.from_pandas(df) >> pq.write_table(table, fname, use_dictionary=False, compression=None) >> return >> >> def read_arrow_parquet(fname: str) -> pd.DataFrame: >> table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True) >> df = table.to_pandas() >> return df >> >>