Re: Parquet file reading performance

Wes McKinney Mon, 30 Sep 2019 08:22:06 -0700

On Sat, Sep 28, 2019 at 3:16 PM Maarten Ballintijn <maart...@xs4all.nl> wrote:
>
> Hi Joris,
>
> Thanks for your detailed analysis!
>
> We can leave the impact of the large DateTimeIndex on the performance for 
> another time.
> (Notes: my laptop has sufficient memory to support it, no error is thrown, the
> resulting DateTimeIndex from the expression is identical to your version or 
> the other version
> in the test. The large DateTimeIndex is released long before the tests 
> happen, yet it has a
> massive impact?? It feels like something is broken)
>
>
> Thanks for clearly demonstrating that the main the issue is with to_pandas()
> That’s very unexpected, in the ’ns’ case I would expect no overhead.
> And even with the ‘us’ case it's only two vector compares and a factor 
> multiply, no?
> Also, Timestamps are quite ubiquitous :-)
>
>
> This leaves me with the following questions:
>
> - Who should I talk to to get this resolved in Pandas?
>
> - Where do I find out more about Parquet v2? And more generally is there an 
> RFC (or similar)
> document that defines the Parquet file format and API?


The one practical barrier to using Parquet V2 endogenously in Python
is resolving PARQUET-458, i.e. implementing the V2 data page encoding
correctly.

If you write V2 files, you may or may not be able to read them
everywhere. So if you are striving for compatibility across many
processing frameworks I would recommend sticking with V1.

For other questions I direct you to d...@parquet.apache.org

> - Do you think it would be possible to take the DateTime column out of Arrow 
> into numpy
> and transform it the to make it more amenable to Pandas? and possibly even 
> for the value columns?
>
> Thanks again and have a great weekend!
> Maarten.
>
>
>
>
> > On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche 
> > <jorisvandenboss...@gmail.com> wrote:
> >
> > From looking a little bit further into this, it seems that it is mainly
> > pandas who is slower in creating a Series from an array of datetime64
> > compared from an array of ints.
> > And especially if it is not nanosecond resolution:
> >
> > In [29]: a_int = pa.array(np.arange(100000))
> >
> > In [30]: %timeit a_int.to_pandas()
> > 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'))
> >
> > In [32]: %timeit a_datetime.to_pandas()
> > 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'), pa.timestamp('us'))
> >
> > In [34]: %timeit a_datetime_us.to_pandas()
> > 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > Creating the datetime64 array inside pyarrow is also a bit slower compared
> > to int (causing the slower conversion of a_datetime), but the above
> > difference between between nanosecond and microsecond resolution is largely
> > due to pandas, not pyarrow (because pandas needs to convert the
> > microseconds to nanoseconds, and during that conversion will also check
> > that no datetimes were out of bounds for this resolution).
> >
> > And in parquet, the datetime data of the index column will be stored in
> > microsecond resolution (even if the original pandas data was nanosecond
> > resolution). And the slower reading of the parquet file with datetime index
> > is thus almost entirely due to the above difference in timing of converting
> > the int or datetime index column to pandas.
> > Parquet nowadays actually supports storing nanosecond resolution, and this
> > can be triggered in pyarrow by passing version="2.0" to pq.write_table (but
> > last what I heard this version is not yet considered production ready).
> >
> > Joris
> >
> > On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> >> Hi Maarten,
> >>
> >> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >> master, and not seeing the difference between both datetime indexes:
> >>
> >> Versions:
> >> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >> 23:01:00)
> >> [GCC 7.3.0] on linux
> >> numpy:    1.16.4
> >> pandas:   0.26.0.dev0+447.gc168ecf26
> >> pyarrow:  0.14.1.dev642+g7f2d637db
> >>
> >> 1073741824 float64 8388608 16
> >> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> >> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> >> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> >> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> >> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> >> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> >> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> >> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> >> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
> >>
> >> The only change I needed to make in the script to get it running (within
> >> my memory limits) was the creation of the second DatetimeIndex
> >> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
> >> 1.5 billion elements, while only the last part of it is used. So changed
> >> that to index = pd.date_range('2018-01-01', '2019-09-01',
> >> freq='S').array[-rows:])
> >>
> >> The datetime index reading in general is still slower as the int index.
> >> But doing a bit more detailed timings, and it seems this is not due to the
> >> reading of parquet, but the conversion of arrow to pandas (using the files
> >> from the benchmark):
> >>
> >> In [1]: import pyarrow.parquet as pq
> >>
> >> In [4]: %timeit pq.read_table('testdata.int.parquet')
> >> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> >> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [6]: table_int = pq.read_table('testdata.int.parquet')
> >>
> >> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
> >>
> >> In [8]: %timeit table_int.to_pandas()
> >> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>
> >> In [9]: %timeit table_datetime.to_pandas()
> >> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> So you can see that the parquet reading part is basically identical, but
> >> the conversion to pandas is much slower for the datetime-index case.
> >> I will try to look into that code path to see what makes this so much
> >> slower.
> >>
> >> Joris
> >>
> >>
> >> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maart...@xs4all.nl>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> The code to show the performance issue with DateTimeIndex is at:
> >>>
> >>>        https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
> >>>
> >>> It shows three case 0) int index, 1) datetime index, 2) date time index
> >>> created in a slightly roundabout way
> >>>
> >>> I’m a little confused by the two date time cases. Case 2) is much slower
> >>> but the df compares identical to case 1)
> >>> (I originally used something like 2) to match our specific data. I don’t
> >>> see why it behaves differently??)
> >>>
> >>> The timings I find are:
> >>>
> >>> 1073741824 float64 8388608 16
> >>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> >>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> >>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> >>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> >>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> >>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> >>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> >>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> >>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
> >>>
> >>> Case 0, int index.  This is all great.
> >>> Case 1, date time index. We loose almost half the speed. Given that a
> >>> datetime is only scaled from Pandas IIRC that seems like a lot?
> >>> Case  3, other datetime index. No idea what is going on.
> >>>
> >>> Any insights are much appreciated.
> >>>
> >>> Cheers,
> >>> Maarten.
> >>>
> >>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> >>>>
> >>>> hi
> >>>>
> >>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maart...@xs4all.nl
> >>> <mailto:maart...@xs4all.nl>> wrote:
> >>>>>
> >>>>> Hi Wes,
> >>>>>
> >>>>> Thanks for your quick response.
> >>>>>
> >>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
> >>>>>
> >>>>> numpy:           1.16.5
> >>>>> pandas:          0.25.1
> >>>>> pyarrow:         0.14.1
> >>>>>
> >>>>> It looks like 0.15 is close, so I can wait for that.
> >>>>>
> >>>>> Theoretically I see three components driving the performance:
> >>>>> 1) The cost of locating the column (directory overhead)
> >>>>> 2) The overhead of reading a single column. (reading and processing
> >>> meta data, setting up for reading)
> >>>>> 3) Bulk reading and unmarshalling/decoding the data.
> >>>>>
> >>>>> Only 1) would be impacted by the number of columns, but if you’re
> >>> reading everything ideally this would not be a problem.
> >>>>
> >>>> The problem is more nuanced than that. Parquet's metadata is somewhat
> >>>> "heavy" at the column level. So when you're writing thousands of
> >>>> columns, the fixed overhead associated with reading a single column
> >>>> becomes problematic. There are several data structures associated with
> >>>> decoding a column have a fixed setup and teardown cost. Even if there
> >>>> is 1 millisecond of fixed overhead related to reading a column (I
> >>>> don't know what the number is exactly) then reading 10,000 columns has
> >>>> 10 seconds of unavoidable overhead. It might be useful for us to
> >>>> quantify and communicate the expected overhead when metadata and
> >>>> decoding is taken into account. Simply put having more than 1000
> >>>> columns is not advisable.
> >>>>
> >>>>> Based on an initial cursory look at the Parquet format I guess the
> >>> index and the column meta-data might need to be read in full so I can see
> >>> how that might slow down reading only a few columns out of a large set. 
> >>> But
> >>> that was not really the case here?
> >>>>>
> >>>>> What would you suggest for looking into the date index slow-down?
> >>>>
> >>>> Can you show a code example to make things easier for us to see what
> >>>> you're seeing?
> >>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Maarten.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> hi Maarten,
> >>>>>>
> >>>>>> Are you using the master branch or 0.14.1? There are a number of
> >>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
> >>>>>> master branch, to appear as 0.15.0 relatively soon.
> >>>>>>
> >>>>>> As a file format, Parquet (and columnar formats in general) is not
> >>>>>> known to perform well with more than 1000 columns.
> >>>>>>
> >>>>>> On the other items, we'd be happy to work with you to dig through the
> >>>>>> performance issues you're seeing.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Wes
> >>>>>>
> >>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
> >>> maart...@xs4all.nl> wrote:
> >>>>>>>
> >>>>>>> Greetings,
> >>>>>>>
> >>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
> >>> DateTimeIndex.
> >>>>>>> They have about 20,000 columns with integer column labels, and data
> >>> with a dtype of float32.
> >>>>>>>
> >>>>>>> We’d like to store these dataframes with parquet, using the ability
> >>> to read a subset of columns and to store meta-data with the file.
> >>>>>>>
> >>>>>>> We’ve found the reading performance less than expected compared to
> >>> the published benchmarks (e.g. Wes’ blog post).
> >>>>>>>
> >>>>>>> Using a modified version of his script we did reproduce his results
> >>> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>>>>
> >>>>>>> But there seem to be three factors that contribute to the slowdown
> >>> for our datasets:
> >>>>>>>
> >>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
> >>> factor 5).
> >>>>>>> - The number of columns impact reading speed significantly (factor
> >>> ~2 going from 16 to 16,000 columns)
> >>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
> >>> and appears unnecessary? (about 40%)
> >>>>>>>
> >>>>>>> Are there ways we could speedup the reading? Should we use a
> >>> different layout?
> >>>>>>>
> >>>>>>> Thanks for your help and insights!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Maarten
> >>>>>>>
> >>>>>>>
> >>>>>>> ps. the routines we used:
> >>>>>>>
> >>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>>>>>>  table = pa.Table.from_pandas(df)
> >>>>>>>  pq.write_table(table, fname, use_dictionary=False,
> >>> compression=None)
> >>>>>>>  return
> >>>>>>>
> >>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>>>>  table = pq.read_table(fname, use_pandas_metadata=False,
> >>> use_threads=True)
> >>>>>>>  df = table.to_pandas()
> >>>>>>>  return df
> >>>
> >>>
>

Re: Parquet file reading performance

Reply via email to