hi Maarten,

Are you using the master branch or 0.14.1? There are a number of
performance regressions in 0.14.0/0.14.1 that are addressed in the
master branch, to appear as 0.15.0 relatively soon.

As a file format, Parquet (and columnar formats in general) is not
known to perform well with more than 1000 columns.

On the other items, we'd be happy to work with you to dig through the
performance issues you're seeing.

Thanks
Wes

On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <maart...@xs4all.nl> wrote:
>
> Greetings,
>
> We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
> They have about 20,000 columns with integer column labels, and data with a 
> dtype of float32.
>
> We’d like to store these dataframes with parquet, using the ability to read a 
> subset of columns and to store meta-data with the file.
>
> We’ve found the reading performance less than expected compared to the 
> published benchmarks (e.g. Wes’ blog post).
>
> Using a modified version of his script we did reproduce his results (~ 1GB/s 
> for high entropy, no dict on MacBook pro)
>
> But there seem to be three factors that contribute to the slowdown for our 
> datasets:
>
> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
> - The number of columns impact reading speed significantly (factor ~2 going 
> from 16 to 16,000 columns)
> - The ‘use_pandas_metadata=True’ slows down reading significantly and appears 
> unnecessary? (about 40%)
>
> Are there ways we could speedup the reading? Should we use a different layout?
>
> Thanks for your help and insights!
>
> Cheers,
> Maarten
>
>
> ps. the routines we used:
>
> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>     table = pa.Table.from_pandas(df)
>     pq.write_table(table, fname, use_dictionary=False, compression=None)
>     return
>
> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>     table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>     df = table.to_pandas()
>     return df
>
>

Reply via email to