Re: Parquet file reading performance

Maarten Ballintijn Tue, 01 Oct 2019 12:04:24 -0700


I ran cProfile to understand better what is going on in Pandas. Using your code 
below I find that
Pandas runs a loop over generic the datetime64 conversion in case the 
datetime64 is not in ’ns’.
The conversion unpacks the time into a date-time struct and converts the 
date-time struct back
into a Timestamp. This is clearly slow and unnecessary.


Adding a dedicated us->ns conversion should be possible with much better 
performance.
(either in Cython or C).

I can add a request for this on Pandas-issues unless someone thinks this is not 
a good idea?

Somewhat surprisingly, specifying the dtype on the pd.Series constructor 
triggers another code path
which is a further factor ~5 slower.


> On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche 
> <jorisvandenboss...@gmail.com> wrote:
> 
> Some answers to the other questions:
> 
> On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <maart...@xs4all.nl> wrote:
> 
>> ...
>> This leaves me with the following questions:
>> 
>> - Who should I talk to to get this resolved in Pandas?
>> 
>> You can open an issue on their tracker:
> https://github.com/pandas-dev/pandas/issues/
> 
> 
>> - Do you think it would be possible to take the DateTime column out of
>> Arrow into numpy
>> and transform it the to make it more amenable to Pandas? and possibly even
>> for the value columns?
>> 
> 
> The reason that this non-ns column takes more time is due to the conversion
> to nanoseconds.
> So one way this could be to already cast your arrow table to ns before
> converting to pandas (this is implemented in arrow, but currently does not
> check of out of bounds values; I opened
> https://issues.apache.org/jira/browse/ARROW-6704 for this).
> 
> But, I am not sure it can be improved with a big factor on the pandas side.
> Let's look at some timings using:
> 
> arr_int = np.arange(100000)
> arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S'))
> arr_us = arr_ns.astype("datetime64[us]")
> 
> conversion from those arrays to a pandas Series only takes more time for
> the microsecond datetimes:
> 
> In [6]: %timeit pd.Series(arr_int)
> 47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 
> In [7]: %timeit pd.Series(arr_ns)
> 27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 
> In [8]: %timeit pd.Series(arr_us)
> 6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> 
> But just converting microseconds to nanoseconds also already takes almost a
> millisecond on this data (with numpy, but arrow is not faster):
> 
> In [15]: %timeit arr_us.astype("datetime64[ns]")
> 742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> 
> So it should certainly be possible to get those 6ms down to something
> around 1ms (and which can be a nice improvement! you're welcome to open an
> issue on the pandas issue tracker for it), but it will still be factor of
> 20 to 30 slower than the nanosecond case.
> 
> Joris
> 
> 
>>> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <
>> jorisvandenboss...@gmail.com> wrote:
>>> 
>>> From looking a little bit further into this, it seems that it is mainly
>>> pandas who is slower in creating a Series from an array of datetime64
>>> compared from an array of ints.
>>> And especially if it is not nanosecond resolution:
>>> 
>>> In [29]: a_int = pa.array(np.arange(100000))
>>> 
>>> In [30]: %timeit a_int.to_pandas()
>>> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> 
>>> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
>>> freq='S'))
>>> 
>>> In [32]: %timeit a_datetime.to_pandas()
>>> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> 
>>> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
>>> freq='S'), pa.timestamp('us'))
>>> 
>>> In [34]: %timeit a_datetime_us.to_pandas()
>>> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> 
>>> Creating the datetime64 array inside pyarrow is also a bit slower
>> compared
>>> to int (causing the slower conversion of a_datetime), but the above
>>> difference between between nanosecond and microsecond resolution is
>> largely
>>> due to pandas, not pyarrow (because pandas needs to convert the
>>> microseconds to nanoseconds, and during that conversion will also check
>>> that no datetimes were out of bounds for this resolution).
>>> 
>>> And in parquet, the datetime data of the index column will be stored in
>>> microsecond resolution (even if the original pandas data was nanosecond
>>> resolution). And the slower reading of the parquet file with datetime
>> index
>>> is thus almost entirely due to the above difference in timing of
>> converting
>>> the int or datetime index column to pandas.
>>> Parquet nowadays actually supports storing nanosecond resolution, and
>> this
>>> can be triggered in pyarrow by passing version="2.0" to pq.write_table
>> (but
>>> last what I heard this version is not yet considered production ready).
>>> 
>>> Joris
>>> 
>>> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
>>> jorisvandenboss...@gmail.com> wrote:
>>> 
>>>> Hi Maarten,
>>>> 
>>>> Thanks for the reproducible script. I ran it on my laptop on pyarrow
>>>> master, and not seeing the difference between both datetime indexes:
>>>> 
>>>> Versions:
>>>> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
>>>> 23:01:00)
>>>> [GCC 7.3.0] on linux
>>>> numpy:    1.16.4
>>>> pandas:   0.26.0.dev0+447.gc168ecf26
>>>> pyarrow:  0.14.1.dev642+g7f2d637db
>>>> 
>>>> 1073741824 float64 8388608 16
>>>> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
>>>> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
>>>> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
>>>> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
>>>> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
>>>> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
>>>> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
>>>> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
>>>> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
>>>> 
>>>> The only change I needed to make in the script to get it running (within
>>>> my memory limits) was the creation of the second DatetimeIndex
>>>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
>>>> 1.5 billion elements, while only the last part of it is used. So changed
>>>> that to index = pd.date_range('2018-01-01', '2019-09-01',
>>>> freq='S').array[-rows:])
>>>> 
>>>> The datetime index reading in general is still slower as the int index.
>>>> But doing a bit more detailed timings, and it seems this is not due to
>> the
>>>> reading of parquet, but the conversion of arrow to pandas (using the
>> files
>>>> from the benchmark):
>>>> 
>>>> In [1]: import pyarrow.parquet as pq
>>>> 
>>>> In [4]: %timeit pq.read_table('testdata.int.parquet')
>>>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>> 
>>>> In [5]: %timeit pq.read_table('testdata.dt.parquet')
>>>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>> 
>>>> In [6]: table_int = pq.read_table('testdata.int.parquet')
>>>> 
>>>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
>>>> 
>>>> In [8]: %timeit table_int.to_pandas()
>>>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>>> 
>>>> In [9]: %timeit table_datetime.to_pandas()
>>>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>> 
>>>> So you can see that the parquet reading part is basically identical, but
>>>> the conversion to pandas is much slower for the datetime-index case.
>>>> I will try to look into that code path to see what makes this so much
>>>> slower.
>>>> 
>>>> Joris
>>>> 
>>>> 
>>>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maart...@xs4all.nl>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> The code to show the performance issue with DateTimeIndex is at:
>>>>> 
>>>>> 
>> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
>>>>> 
>>>>> It shows three case 0) int index, 1) datetime index, 2) date time index
>>>>> created in a slightly roundabout way
>>>>> 
>>>>> I’m a little confused by the two date time cases. Case 2) is much
>> slower
>>>>> but the df compares identical to case 1)
>>>>> (I originally used something like 2) to match our specific data. I
>> don’t
>>>>> see why it behaves differently??)
>>>>> 
>>>>> The timings I find are:
>>>>> 
>>>>> 1073741824 float64 8388608 16
>>>>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
>>>>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
>>>>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
>>>>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
>>>>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
>>>>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
>>>>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
>>>>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
>>>>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
>>>>> 
>>>>> Case 0, int index.  This is all great.
>>>>> Case 1, date time index. We loose almost half the speed. Given that a
>>>>> datetime is only scaled from Pandas IIRC that seems like a lot?
>>>>> Case  3, other datetime index. No idea what is going on.
>>>>> 
>>>>> Any insights are much appreciated.
>>>>> 
>>>>> Cheers,
>>>>> Maarten.
>>>>> 
>>>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> hi
>>>>>> 
>>>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <
>> maart...@xs4all.nl
>>>>> <mailto:maart...@xs4all.nl>> wrote:
>>>>>>> 
>>>>>>> Hi Wes,
>>>>>>> 
>>>>>>> Thanks for your quick response.
>>>>>>> 
>>>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>>>>>>> 
>>>>>>> numpy:           1.16.5
>>>>>>> pandas:          0.25.1
>>>>>>> pyarrow:         0.14.1
>>>>>>> 
>>>>>>> It looks like 0.15 is close, so I can wait for that.
>>>>>>> 
>>>>>>> Theoretically I see three components driving the performance:
>>>>>>> 1) The cost of locating the column (directory overhead)
>>>>>>> 2) The overhead of reading a single column. (reading and processing
>>>>> meta data, setting up for reading)
>>>>>>> 3) Bulk reading and unmarshalling/decoding the data.
>>>>>>> 
>>>>>>> Only 1) would be impacted by the number of columns, but if you’re
>>>>> reading everything ideally this would not be a problem.
>>>>>> 
>>>>>> The problem is more nuanced than that. Parquet's metadata is somewhat
>>>>>> "heavy" at the column level. So when you're writing thousands of
>>>>>> columns, the fixed overhead associated with reading a single column
>>>>>> becomes problematic. There are several data structures associated with
>>>>>> decoding a column have a fixed setup and teardown cost. Even if there
>>>>>> is 1 millisecond of fixed overhead related to reading a column (I
>>>>>> don't know what the number is exactly) then reading 10,000 columns has
>>>>>> 10 seconds of unavoidable overhead. It might be useful for us to
>>>>>> quantify and communicate the expected overhead when metadata and
>>>>>> decoding is taken into account. Simply put having more than 1000
>>>>>> columns is not advisable.
>>>>>> 
>>>>>>> Based on an initial cursory look at the Parquet format I guess the
>>>>> index and the column meta-data might need to be read in full so I can
>> see
>>>>> how that might slow down reading only a few columns out of a large
>> set. But
>>>>> that was not really the case here?
>>>>>>> 
>>>>>>> What would you suggest for looking into the date index slow-down?
>>>>>> 
>>>>>> Can you show a code example to make things easier for us to see what
>>>>>> you're seeing?
>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Maarten.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> hi Maarten,
>>>>>>>> 
>>>>>>>> Are you using the master branch or 0.14.1? There are a number of
>>>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
>>>>>>>> master branch, to appear as 0.15.0 relatively soon.
>>>>>>>> 
>>>>>>>> As a file format, Parquet (and columnar formats in general) is not
>>>>>>>> known to perform well with more than 1000 columns.
>>>>>>>> 
>>>>>>>> On the other items, we'd be happy to work with you to dig through
>> the
>>>>>>>> performance issues you're seeing.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Wes
>>>>>>>> 
>>>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
>>>>> maart...@xs4all.nl> wrote:
>>>>>>>>> 
>>>>>>>>> Greetings,
>>>>>>>>> 
>>>>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
>>>>> DateTimeIndex.
>>>>>>>>> They have about 20,000 columns with integer column labels, and data
>>>>> with a dtype of float32.
>>>>>>>>> 
>>>>>>>>> We’d like to store these dataframes with parquet, using the ability
>>>>> to read a subset of columns and to store meta-data with the file.
>>>>>>>>> 
>>>>>>>>> We’ve found the reading performance less than expected compared to
>>>>> the published benchmarks (e.g. Wes’ blog post).
>>>>>>>>> 
>>>>>>>>> Using a modified version of his script we did reproduce his results
>>>>> (~ 1GB/s for high entropy, no dict on MacBook pro)
>>>>>>>>> 
>>>>>>>>> But there seem to be three factors that contribute to the slowdown
>>>>> for our datasets:
>>>>>>>>> 
>>>>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
>>>>> factor 5).
>>>>>>>>> - The number of columns impact reading speed significantly (factor
>>>>> ~2 going from 16 to 16,000 columns)
>>>>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
>>>>> and appears unnecessary? (about 40%)
>>>>>>>>> 
>>>>>>>>> Are there ways we could speedup the reading? Should we use a
>>>>> different layout?
>>>>>>>>> 
>>>>>>>>> Thanks for your help and insights!
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Maarten
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ps. the routines we used:
>>>>>>>>> 
>>>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>>>>>>>> table = pa.Table.from_pandas(df)
>>>>>>>>> pq.write_table(table, fname, use_dictionary=False,
>>>>> compression=None)
>>>>>>>>> return
>>>>>>>>> 
>>>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>>>>>>>> table = pq.read_table(fname, use_pandas_metadata=False,
>>>>> use_threads=True)
>>>>>>>>> df = table.to_pandas()
>>>>>>>>> return df
>>>>> 
>>>>> 
>> 
>>

Re: Parquet file reading performance

Reply via email to