Re: Parquet file reading performance

2019-10-01 Thread Joris Van den Bossche
On Tue, 1 Oct 2019 at 21:03, Maarten Ballintijn wrote: > > I ran cProfile to understand better what is going on in Pandas. Using your > code below I find that > Pandas runs a loop over generic the datetime64 conversion in case the > datetime64 is not in ’ns’. > The conversion unpacks the time

Re: Parquet file reading performance

2019-10-01 Thread Joris Van den Bossche
Some answers to the other questions: On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn wrote: > ... > This leaves me with the following questions: > > - Who should I talk to to get this resolved in Pandas? > > You can open an issue on their tracker: https://github.com/pandas-dev/pandas/issues/

Re: Parquet file reading performance

2019-09-30 Thread Wes McKinney
On Sat, Sep 28, 2019 at 3:16 PM Maarten Ballintijn wrote: > > Hi Joris, > > Thanks for your detailed analysis! > > We can leave the impact of the large DateTimeIndex on the performance for > another time. > (Notes: my laptop has sufficient memory to support it, no error is thrown, the >

Re: Parquet file reading performance

2019-09-28 Thread Maarten Ballintijn
Hi Joris, Thanks for your detailed analysis! We can leave the impact of the large DateTimeIndex on the performance for another time. (Notes: my laptop has sufficient memory to support it, no error is thrown, the resulting DateTimeIndex from the expression is identical to your version or the

Re: Parquet file reading performance

2019-09-25 Thread Joris Van den Bossche
Hi Maarten, Thanks for the reproducible script. I ran it on my laptop on pyarrow master, and not seeing the difference between both datetime indexes: Versions: Python: 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00) [GCC 7.3.0] on linux numpy:1.16.4 pandas:

Re: Parquet file reading performance

2019-09-24 Thread Maarten Ballintijn
Hi, The code to show the performance issue with DateTimeIndex is at: https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c It shows three case 0) int index, 1) datetime index, 2) date time index created in a slightly roundabout way I’m a little confused by the two

Re: Parquet file reading performance

2019-09-24 Thread Wes McKinney
hi On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn wrote: > > Hi Wes, > > Thanks for your quick response. > > Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: > > numpy: 1.16.5 > pandas: 0.25.1 > pyarrow: 0.14.1 > > It looks like 0.15 is close, so

Re: Parquet file reading performance

2019-09-24 Thread Maarten Ballintijn
Hi Wes, Thanks for your quick response. Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: numpy: 1.16.5 pandas: 0.25.1 pyarrow: 0.14.1 It looks like 0.15 is close, so I can wait for that. Theoretically I see three components driving the

Re: Parquet file reading performance

2019-09-23 Thread Wes McKinney
hi Maarten, Are you using the master branch or 0.14.1? There are a number of performance regressions in 0.14.0/0.14.1 that are addressed in the master branch, to appear as 0.15.0 relatively soon. As a file format, Parquet (and columnar formats in general) is not known to perform well with more