Hi Wes,

Thanks for your quick response.

Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:

numpy:           1.16.5
pandas:          0.25.1
pyarrow:         0.14.1

It looks like 0.15 is close, so I can wait for that.

Theoretically I see three components driving the performance:
1) The cost of locating the column (directory overhead)
2) The overhead of reading a single column. (reading and processing meta data, 
setting up for reading)
3) Bulk reading and unmarshalling/decoding the data.

Only 1) would be impacted by the number of columns, but if you’re reading 
everything ideally this would not be a problem. 

Based on an initial cursory look at the Parquet format I guess the index and 
the column meta-data might need to be read in full so I can see how that  might 
slow down reading only a few columns out of a large set. But that was not 
really the case here?

What would you suggest for looking into the date index slow-down?

Cheers,
Maarten.



> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> hi Maarten,
> 
> Are you using the master branch or 0.14.1? There are a number of
> performance regressions in 0.14.0/0.14.1 that are addressed in the
> master branch, to appear as 0.15.0 relatively soon.
> 
> As a file format, Parquet (and columnar formats in general) is not
> known to perform well with more than 1000 columns.
> 
> On the other items, we'd be happy to work with you to dig through the
> performance issues you're seeing.
> 
> Thanks
> Wes
> 
> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <maart...@xs4all.nl> wrote:
>> 
>> Greetings,
>> 
>> We have Pandas DataFrames with typically about 6,000 rows using 
>> DateTimeIndex.
>> They have about 20,000 columns with integer column labels, and data with a 
>> dtype of float32.
>> 
>> We’d like to store these dataframes with parquet, using the ability to read 
>> a subset of columns and to store meta-data with the file.
>> 
>> We’ve found the reading performance less than expected compared to the 
>> published benchmarks (e.g. Wes’ blog post).
>> 
>> Using a modified version of his script we did reproduce his results (~ 1GB/s 
>> for high entropy, no dict on MacBook pro)
>> 
>> But there seem to be three factors that contribute to the slowdown for our 
>> datasets:
>> 
>> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
>> - The number of columns impact reading speed significantly (factor ~2 going 
>> from 16 to 16,000 columns)
>> - The ‘use_pandas_metadata=True’ slows down reading significantly and 
>> appears unnecessary? (about 40%)
>> 
>> Are there ways we could speedup the reading? Should we use a different 
>> layout?
>> 
>> Thanks for your help and insights!
>> 
>> Cheers,
>> Maarten
>> 
>> 
>> ps. the routines we used:
>> 
>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>    table = pa.Table.from_pandas(df)
>>    pq.write_table(table, fname, use_dictionary=False, compression=None)
>>    return
>> 
>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>    table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>>    df = table.to_pandas()
>>    return df
>> 
>> 

Reply via email to