hi

On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maart...@xs4all.nl> wrote:
>
> Hi Wes,
>
> Thanks for your quick response.
>
> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>
> numpy:           1.16.5
> pandas:          0.25.1
> pyarrow:         0.14.1
>
> It looks like 0.15 is close, so I can wait for that.
>
> Theoretically I see three components driving the performance:
> 1) The cost of locating the column (directory overhead)
> 2) The overhead of reading a single column. (reading and processing meta 
> data, setting up for reading)
> 3) Bulk reading and unmarshalling/decoding the data.
>
> Only 1) would be impacted by the number of columns, but if you’re reading 
> everything ideally this would not be a problem.

The problem is more nuanced than that. Parquet's metadata is somewhat
"heavy" at the column level. So when you're writing thousands of
columns, the fixed overhead associated with reading a single column
becomes problematic. There are several data structures associated with
decoding a column have a fixed setup and teardown cost. Even if there
is 1 millisecond of fixed overhead related to reading a column (I
don't know what the number is exactly) then reading 10,000 columns has
10 seconds of unavoidable overhead. It might be useful for us to
quantify and communicate the expected overhead when metadata and
decoding is taken into account. Simply put having more than 1000
columns is not advisable.

> Based on an initial cursory look at the Parquet format I guess the index and 
> the column meta-data might need to be read in full so I can see how that  
> might slow down reading only a few columns out of a large set. But that was 
> not really the case here?
>
> What would you suggest for looking into the date index slow-down?

Can you show a code example to make things easier for us to see what
you're seeing?

>
> Cheers,
> Maarten.
>
>
>
> > On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > hi Maarten,
> >
> > Are you using the master branch or 0.14.1? There are a number of
> > performance regressions in 0.14.0/0.14.1 that are addressed in the
> > master branch, to appear as 0.15.0 relatively soon.
> >
> > As a file format, Parquet (and columnar formats in general) is not
> > known to perform well with more than 1000 columns.
> >
> > On the other items, we'd be happy to work with you to dig through the
> > performance issues you're seeing.
> >
> > Thanks
> > Wes
> >
> > On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <maart...@xs4all.nl> 
> > wrote:
> >>
> >> Greetings,
> >>
> >> We have Pandas DataFrames with typically about 6,000 rows using 
> >> DateTimeIndex.
> >> They have about 20,000 columns with integer column labels, and data with a 
> >> dtype of float32.
> >>
> >> We’d like to store these dataframes with parquet, using the ability to 
> >> read a subset of columns and to store meta-data with the file.
> >>
> >> We’ve found the reading performance less than expected compared to the 
> >> published benchmarks (e.g. Wes’ blog post).
> >>
> >> Using a modified version of his script we did reproduce his results (~ 
> >> 1GB/s for high entropy, no dict on MacBook pro)
> >>
> >> But there seem to be three factors that contribute to the slowdown for our 
> >> datasets:
> >>
> >> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
> >> - The number of columns impact reading speed significantly (factor ~2 
> >> going from 16 to 16,000 columns)
> >> - The ‘use_pandas_metadata=True’ slows down reading significantly and 
> >> appears unnecessary? (about 40%)
> >>
> >> Are there ways we could speedup the reading? Should we use a different 
> >> layout?
> >>
> >> Thanks for your help and insights!
> >>
> >> Cheers,
> >> Maarten
> >>
> >>
> >> ps. the routines we used:
> >>
> >> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>    table = pa.Table.from_pandas(df)
> >>    pq.write_table(table, fname, use_dictionary=False, compression=None)
> >>    return
> >>
> >> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>    table = pq.read_table(fname, use_pandas_metadata=False, 
> >> use_threads=True)
> >>    df = table.to_pandas()
> >>    return df
> >>
> >>
>

Reply via email to