Re: Parquet file reading performance

Joris Van den Bossche Wed, 25 Sep 2019 07:04:22 -0700

Hi Maarten,

Thanks for the reproducible script. I ran it on my laptop on pyarrow
master, and not seeing the difference between both datetime indexes:


Versions:
Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
23:01:00)
[GCC 7.3.0] on linux
numpy:    1.16.4
pandas:   0.26.0.dev0+447.gc168ecf26
pyarrow:  0.14.1.dev642+g7f2d637db

1073741824 float64 8388608 16
0: make_dataframe                 :   1443.483 msec,  709 MB/s
0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
1: make_dataframe                 :   1412.575 msec,  725 MB/s
1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
2: make_dataframe                 :   1490.165 msec,  687 MB/s
2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<

The only change I needed to make in the script to get it running (within my
memory limits) was the creation of the second DatetimeIndex
(pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
1.5 billion elements, while only the last part of it is used. So changed
that to index = pd.date_range('2018-01-01', '2019-09-01',
freq='S').array[-rows:])

The datetime index reading in general is still slower as the int index. But
doing a bit more detailed timings, and it seems this is not due to the
reading of parquet, but the conversion of arrow to pandas (using the files
from the benchmark):

In [1]: import pyarrow.parquet as pq

In [4]: %timeit pq.read_table('testdata.int.parquet')
41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit pq.read_table('testdata.dt.parquet')
43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: table_int = pq.read_table('testdata.int.parquet')

In [7]: table_datetime = pq.read_table('testdata.dt.parquet')

In [8]: %timeit table_int.to_pandas()
14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit table_datetime.to_pandas()
47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So you can see that the parquet reading part is basically identical, but
the conversion to pandas is much slower for the datetime-index case.
I will try to look into that code path to see what makes this so much
slower.

Joris


On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maart...@xs4all.nl> wrote:

> Hi,
>
> The code to show the performance issue with DateTimeIndex is at:
>
>         https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
>
> It shows three case 0) int index, 1) datetime index, 2) date time index
> created in a slightly roundabout way
>
> I’m a little confused by the two date time cases. Case 2) is much slower
> but the df compares identical to case 1)
> (I originally used something like 2) to match our specific data. I don’t
> see why it behaves differently??)
>
> The timings I find are:
>
> 1073741824 float64 8388608 16
> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
>
> Case 0, int index.  This is all great.
> Case 1, date time index. We loose almost half the speed. Given that a
> datetime is only scaled from Pandas IIRC that seems like a lot?
> Case  3, other datetime index. No idea what is going on.
>
> Any insights are much appreciated.
>
> Cheers,
> Maarten.
>
> > On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > hi
> >
> > On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maart...@xs4all.nl
> <mailto:maart...@xs4all.nl>> wrote:
> >>
> >> Hi Wes,
> >>
> >> Thanks for your quick response.
> >>
> >> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
> >>
> >> numpy:           1.16.5
> >> pandas:          0.25.1
> >> pyarrow:         0.14.1
> >>
> >> It looks like 0.15 is close, so I can wait for that.
> >>
> >> Theoretically I see three components driving the performance:
> >> 1) The cost of locating the column (directory overhead)
> >> 2) The overhead of reading a single column. (reading and processing
> meta data, setting up for reading)
> >> 3) Bulk reading and unmarshalling/decoding the data.
> >>
> >> Only 1) would be impacted by the number of columns, but if you’re
> reading everything ideally this would not be a problem.
> >
> > The problem is more nuanced than that. Parquet's metadata is somewhat
> > "heavy" at the column level. So when you're writing thousands of
> > columns, the fixed overhead associated with reading a single column
> > becomes problematic. There are several data structures associated with
> > decoding a column have a fixed setup and teardown cost. Even if there
> > is 1 millisecond of fixed overhead related to reading a column (I
> > don't know what the number is exactly) then reading 10,000 columns has
> > 10 seconds of unavoidable overhead. It might be useful for us to
> > quantify and communicate the expected overhead when metadata and
> > decoding is taken into account. Simply put having more than 1000
> > columns is not advisable.
> >
> >> Based on an initial cursory look at the Parquet format I guess the
> index and the column meta-data might need to be read in full so I can see
> how that might slow down reading only a few columns out of a large set. But
> that was not really the case here?
> >>
> >> What would you suggest for looking into the date index slow-down?
> >
> > Can you show a code example to make things easier for us to see what
> > you're seeing?
> >
> >>
> >> Cheers,
> >> Maarten.
> >>
> >>
> >>
> >>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >>>
> >>> hi Maarten,
> >>>
> >>> Are you using the master branch or 0.14.1? There are a number of
> >>> performance regressions in 0.14.0/0.14.1 that are addressed in the
> >>> master branch, to appear as 0.15.0 relatively soon.
> >>>
> >>> As a file format, Parquet (and columnar formats in general) is not
> >>> known to perform well with more than 1000 columns.
> >>>
> >>> On the other items, we'd be happy to work with you to dig through the
> >>> performance issues you're seeing.
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <maart...@xs4all.nl>
> wrote:
> >>>>
> >>>> Greetings,
> >>>>
> >>>> We have Pandas DataFrames with typically about 6,000 rows using
> DateTimeIndex.
> >>>> They have about 20,000 columns with integer column labels, and data
> with a dtype of float32.
> >>>>
> >>>> We’d like to store these dataframes with parquet, using the ability
> to read a subset of columns and to store meta-data with the file.
> >>>>
> >>>> We’ve found the reading performance less than expected compared to
> the published benchmarks (e.g. Wes’ blog post).
> >>>>
> >>>> Using a modified version of his script we did reproduce his results
> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>
> >>>> But there seem to be three factors that contribute to the slowdown
> for our datasets:
> >>>>
> >>>> - DateTimeIndex is much slower then a Int index (we see about a
> factor 5).
> >>>> - The number of columns impact reading speed significantly (factor ~2
> going from 16 to 16,000 columns)
> >>>> - The ‘use_pandas_metadata=True’ slows down reading significantly and
> appears unnecessary? (about 40%)
> >>>>
> >>>> Are there ways we could speedup the reading? Should we use a
> different layout?
> >>>>
> >>>> Thanks for your help and insights!
> >>>>
> >>>> Cheers,
> >>>> Maarten
> >>>>
> >>>>
> >>>> ps. the routines we used:
> >>>>
> >>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>>>   table = pa.Table.from_pandas(df)
> >>>>   pq.write_table(table, fname, use_dictionary=False, compression=None)
> >>>>   return
> >>>>
> >>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>   table = pq.read_table(fname, use_pandas_metadata=False,
> use_threads=True)
> >>>>   df = table.to_pandas()
> >>>>   return df
>
>

Re: Parquet file reading performance

Reply via email to