On Tue, 1 Oct 2019 at 21:03, Maarten Ballintijn <maart...@xs4all.nl> wrote:
> > I ran cProfile to understand better what is going on in Pandas. Using your > code below I find that > Pandas runs a loop over generic the datetime64 conversion in case the > datetime64 is not in ’ns’. > The conversion unpacks the time into a date-time struct and converts the > date-time struct back > into a Timestamp. This is clearly slow and unnecessary. > > Adding a dedicated us->ns conversion should be possible with much better > performance. > (either in Cython or C). > > I can add a request for this on Pandas-issues unless someone thinks this > is not a good idea? > > Hi Maarten, Also speaking as a pandas developer, I think it is certainly worth raising this on the pandas issue tracker. > Somewhat surprisingly, specifying the dtype on the pd.Series constructor > triggers another code path > which is a further factor ~5 slower. > Hmm, it seems in that case it is falling back to `pd.to_datetime`, which has apparently no special support at all for non-ns datetime64 arrays, and converts it first to an object array to then parse it again .. Joris > > > > On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > > > > Some answers to the other questions: > > > > On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <maart...@xs4all.nl> > wrote: > > > >> ... > >> This leaves me with the following questions: > >> > >> - Who should I talk to to get this resolved in Pandas? > >> > >> You can open an issue on their tracker: > > https://github.com/pandas-dev/pandas/issues/ > > > > > >> - Do you think it would be possible to take the DateTime column out of > >> Arrow into numpy > >> and transform it the to make it more amenable to Pandas? and possibly > even > >> for the value columns? > >> > > > > The reason that this non-ns column takes more time is due to the > conversion > > to nanoseconds. > > So one way this could be to already cast your arrow table to ns before > > converting to pandas (this is implemented in arrow, but currently does > not > > check of out of bounds values; I opened > > https://issues.apache.org/jira/browse/ARROW-6704 for this). > > > > But, I am not sure it can be improved with a big factor on the pandas > side. > > Let's look at some timings using: > > > > arr_int = np.arange(100000) > > arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S')) > > arr_us = arr_ns.astype("datetime64[us]") > > > > conversion from those arrays to a pandas Series only takes more time for > > the microsecond datetimes: > > > > In [6]: %timeit pd.Series(arr_int) > > 47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > > > > In [7]: %timeit pd.Series(arr_ns) > > 27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > > > > In [8]: %timeit pd.Series(arr_us) > > 6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > > > But just converting microseconds to nanoseconds also already takes > almost a > > millisecond on this data (with numpy, but arrow is not faster): > > > > In [15]: %timeit arr_us.astype("datetime64[ns]") > > 742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > > > > So it should certainly be possible to get those 6ms down to something > > around 1ms (and which can be a nice improvement! you're welcome to open > an > > issue on the pandas issue tracker for it), but it will still be factor of > > 20 to 30 slower than the nanosecond case. > > > > Joris > > > > > >>> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche < > >> jorisvandenboss...@gmail.com> wrote: > >>> > >>> From looking a little bit further into this, it seems that it is mainly > >>> pandas who is slower in creating a Series from an array of datetime64 > >>> compared from an array of ints. > >>> And especially if it is not nanosecond resolution: > >>> > >>> In [29]: a_int = pa.array(np.arange(100000)) > >>> > >>> In [30]: %timeit a_int.to_pandas() > >>> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops > each) > >>> > >>> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000, > >>> freq='S')) > >>> > >>> In [32]: %timeit a_datetime.to_pandas() > >>> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops > each) > >>> > >>> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000, > >>> freq='S'), pa.timestamp('us')) > >>> > >>> In [34]: %timeit a_datetime_us.to_pandas() > >>> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > >>> > >>> Creating the datetime64 array inside pyarrow is also a bit slower > >> compared > >>> to int (causing the slower conversion of a_datetime), but the above > >>> difference between between nanosecond and microsecond resolution is > >> largely > >>> due to pandas, not pyarrow (because pandas needs to convert the > >>> microseconds to nanoseconds, and during that conversion will also check > >>> that no datetimes were out of bounds for this resolution). > >>> > >>> And in parquet, the datetime data of the index column will be stored in > >>> microsecond resolution (even if the original pandas data was nanosecond > >>> resolution). And the slower reading of the parquet file with datetime > >> index > >>> is thus almost entirely due to the above difference in timing of > >> converting > >>> the int or datetime index column to pandas. > >>> Parquet nowadays actually supports storing nanosecond resolution, and > >> this > >>> can be triggered in pyarrow by passing version="2.0" to pq.write_table > >> (but > >>> last what I heard this version is not yet considered production ready). > >>> > >>> Joris > >>> > >>> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche < > >>> jorisvandenboss...@gmail.com> wrote: > >>> > >>>> Hi Maarten, > >>>> > >>>> Thanks for the reproducible script. I ran it on my laptop on pyarrow > >>>> master, and not seeing the difference between both datetime indexes: > >>>> > >>>> Versions: > >>>> Python: 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, > >>>> 23:01:00) > >>>> [GCC 7.3.0] on linux > >>>> numpy: 1.16.4 > >>>> pandas: 0.26.0.dev0+447.gc168ecf26 > >>>> pyarrow: 0.14.1.dev642+g7f2d637db > >>>> > >>>> 1073741824 float64 8388608 16 > >>>> 0: make_dataframe : 1443.483 msec, 709 MB/s > >>>> 0: write_arrow_parquet : 7685.426 msec, 133 MB/s > >>>> 0: read_arrow_parquet : 1262.741 msec, 811 MB/s <<< > >>>> 1: make_dataframe : 1412.575 msec, 725 MB/s > >>>> 1: write_arrow_parquet : 7869.145 msec, 130 MB/s > >>>> 1: read_arrow_parquet : 1947.896 msec, 526 MB/s <<< > >>>> 2: make_dataframe : 1490.165 msec, 687 MB/s > >>>> 2: write_arrow_parquet : 7040.507 msec, 145 MB/s > >>>> 2: read_arrow_parquet : 1888.316 msec, 542 MB/s <<< > >>>> > >>>> The only change I needed to make in the script to get it running > (within > >>>> my memory limits) was the creation of the second DatetimeIndex > >>>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index > of > >>>> 1.5 billion elements, while only the last part of it is used. So > changed > >>>> that to index = pd.date_range('2018-01-01', '2019-09-01', > >>>> freq='S').array[-rows:]) > >>>> > >>>> The datetime index reading in general is still slower as the int > index. > >>>> But doing a bit more detailed timings, and it seems this is not due to > >> the > >>>> reading of parquet, but the conversion of arrow to pandas (using the > >> files > >>>> from the benchmark): > >>>> > >>>> In [1]: import pyarrow.parquet as pq > >>>> > >>>> In [4]: %timeit pq.read_table('testdata.int.parquet') > >>>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > >>>> > >>>> In [5]: %timeit pq.read_table('testdata.dt.parquet') > >>>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > >>>> > >>>> In [6]: table_int = pq.read_table('testdata.int.parquet') > >>>> > >>>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet') > >>>> > >>>> In [8]: %timeit table_int.to_pandas() > >>>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > >>>> > >>>> In [9]: %timeit table_datetime.to_pandas() > >>>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > >>>> > >>>> So you can see that the parquet reading part is basically identical, > but > >>>> the conversion to pandas is much slower for the datetime-index case. > >>>> I will try to look into that code path to see what makes this so much > >>>> slower. > >>>> > >>>> Joris > >>>> > >>>> > >>>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maart...@xs4all.nl> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> The code to show the performance issue with DateTimeIndex is at: > >>>>> > >>>>> > >> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c > >>>>> > >>>>> It shows three case 0) int index, 1) datetime index, 2) date time > index > >>>>> created in a slightly roundabout way > >>>>> > >>>>> I’m a little confused by the two date time cases. Case 2) is much > >> slower > >>>>> but the df compares identical to case 1) > >>>>> (I originally used something like 2) to match our specific data. I > >> don’t > >>>>> see why it behaves differently??) > >>>>> > >>>>> The timings I find are: > >>>>> > >>>>> 1073741824 float64 8388608 16 > >>>>> 0: make_dataframe : 2390.830 msec, 428 MB/s > >>>>> 0: write_arrow_parquet : 2486.463 msec, 412 MB/s > >>>>> 0: read_arrow_parquet : 813.946 msec, 1258 MB/s <<< > >>>>> 1: make_dataframe : 2579.815 msec, 397 MB/s > >>>>> 1: write_arrow_parquet : 2708.151 msec, 378 MB/s > >>>>> 1: read_arrow_parquet : 1413.999 msec, 724 MB/s <<< > >>>>> 2: make_dataframe : 15126.520 msec, 68 MB/s > >>>>> 2: write_arrow_parquet : 9205.815 msec, 111 MB/s > >>>>> 2: read_arrow_parquet : 5929.346 msec, 173 MB/s <<< > >>>>> > >>>>> Case 0, int index. This is all great. > >>>>> Case 1, date time index. We loose almost half the speed. Given that a > >>>>> datetime is only scaled from Pandas IIRC that seems like a lot? > >>>>> Case 3, other datetime index. No idea what is going on. > >>>>> > >>>>> Any insights are much appreciated. > >>>>> > >>>>> Cheers, > >>>>> Maarten. > >>>>> > >>>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmck...@gmail.com> > >> wrote: > >>>>>> > >>>>>> hi > >>>>>> > >>>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn < > >> maart...@xs4all.nl > >>>>> <mailto:maart...@xs4all.nl>> wrote: > >>>>>>> > >>>>>>> Hi Wes, > >>>>>>> > >>>>>>> Thanks for your quick response. > >>>>>>> > >>>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: > >>>>>>> > >>>>>>> numpy: 1.16.5 > >>>>>>> pandas: 0.25.1 > >>>>>>> pyarrow: 0.14.1 > >>>>>>> > >>>>>>> It looks like 0.15 is close, so I can wait for that. > >>>>>>> > >>>>>>> Theoretically I see three components driving the performance: > >>>>>>> 1) The cost of locating the column (directory overhead) > >>>>>>> 2) The overhead of reading a single column. (reading and processing > >>>>> meta data, setting up for reading) > >>>>>>> 3) Bulk reading and unmarshalling/decoding the data. > >>>>>>> > >>>>>>> Only 1) would be impacted by the number of columns, but if you’re > >>>>> reading everything ideally this would not be a problem. > >>>>>> > >>>>>> The problem is more nuanced than that. Parquet's metadata is > somewhat > >>>>>> "heavy" at the column level. So when you're writing thousands of > >>>>>> columns, the fixed overhead associated with reading a single column > >>>>>> becomes problematic. There are several data structures associated > with > >>>>>> decoding a column have a fixed setup and teardown cost. Even if > there > >>>>>> is 1 millisecond of fixed overhead related to reading a column (I > >>>>>> don't know what the number is exactly) then reading 10,000 columns > has > >>>>>> 10 seconds of unavoidable overhead. It might be useful for us to > >>>>>> quantify and communicate the expected overhead when metadata and > >>>>>> decoding is taken into account. Simply put having more than 1000 > >>>>>> columns is not advisable. > >>>>>> > >>>>>>> Based on an initial cursory look at the Parquet format I guess the > >>>>> index and the column meta-data might need to be read in full so I can > >> see > >>>>> how that might slow down reading only a few columns out of a large > >> set. But > >>>>> that was not really the case here? > >>>>>>> > >>>>>>> What would you suggest for looking into the date index slow-down? > >>>>>> > >>>>>> Can you show a code example to make things easier for us to see what > >>>>>> you're seeing? > >>>>>> > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Maarten. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> > >>>>> wrote: > >>>>>>>> > >>>>>>>> hi Maarten, > >>>>>>>> > >>>>>>>> Are you using the master branch or 0.14.1? There are a number of > >>>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the > >>>>>>>> master branch, to appear as 0.15.0 relatively soon. > >>>>>>>> > >>>>>>>> As a file format, Parquet (and columnar formats in general) is not > >>>>>>>> known to perform well with more than 1000 columns. > >>>>>>>> > >>>>>>>> On the other items, we'd be happy to work with you to dig through > >> the > >>>>>>>> performance issues you're seeing. > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> Wes > >>>>>>>> > >>>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn < > >>>>> maart...@xs4all.nl> wrote: > >>>>>>>>> > >>>>>>>>> Greetings, > >>>>>>>>> > >>>>>>>>> We have Pandas DataFrames with typically about 6,000 rows using > >>>>> DateTimeIndex. > >>>>>>>>> They have about 20,000 columns with integer column labels, and > data > >>>>> with a dtype of float32. > >>>>>>>>> > >>>>>>>>> We’d like to store these dataframes with parquet, using the > ability > >>>>> to read a subset of columns and to store meta-data with the file. > >>>>>>>>> > >>>>>>>>> We’ve found the reading performance less than expected compared > to > >>>>> the published benchmarks (e.g. Wes’ blog post). > >>>>>>>>> > >>>>>>>>> Using a modified version of his script we did reproduce his > results > >>>>> (~ 1GB/s for high entropy, no dict on MacBook pro) > >>>>>>>>> > >>>>>>>>> But there seem to be three factors that contribute to the > slowdown > >>>>> for our datasets: > >>>>>>>>> > >>>>>>>>> - DateTimeIndex is much slower then a Int index (we see about a > >>>>> factor 5). > >>>>>>>>> - The number of columns impact reading speed significantly > (factor > >>>>> ~2 going from 16 to 16,000 columns) > >>>>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly > >>>>> and appears unnecessary? (about 40%) > >>>>>>>>> > >>>>>>>>> Are there ways we could speedup the reading? Should we use a > >>>>> different layout? > >>>>>>>>> > >>>>>>>>> Thanks for your help and insights! > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> Maarten > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ps. the routines we used: > >>>>>>>>> > >>>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None: > >>>>>>>>> table = pa.Table.from_pandas(df) > >>>>>>>>> pq.write_table(table, fname, use_dictionary=False, > >>>>> compression=None) > >>>>>>>>> return > >>>>>>>>> > >>>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame: > >>>>>>>>> table = pq.read_table(fname, use_pandas_metadata=False, > >>>>> use_threads=True) > >>>>>>>>> df = table.to_pandas() > >>>>>>>>> return df > >>>>> > >>>>> > >> > >> > >