I ran cProfile to understand better what is going on in Pandas. Using your code below I find that Pandas runs a loop over generic the datetime64 conversion in case the datetime64 is not in ’ns’. The conversion unpacks the time into a date-time struct and converts the date-time struct back into a Timestamp. This is clearly slow and unnecessary.
Adding a dedicated us->ns conversion should be possible with much better performance. (either in Cython or C). I can add a request for this on Pandas-issues unless someone thinks this is not a good idea? Somewhat surprisingly, specifying the dtype on the pd.Series constructor triggers another code path which is a further factor ~5 slower. > On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > Some answers to the other questions: > > On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <maart...@xs4all.nl> wrote: > >> ... >> This leaves me with the following questions: >> >> - Who should I talk to to get this resolved in Pandas? >> >> You can open an issue on their tracker: > https://github.com/pandas-dev/pandas/issues/ > > >> - Do you think it would be possible to take the DateTime column out of >> Arrow into numpy >> and transform it the to make it more amenable to Pandas? and possibly even >> for the value columns? >> > > The reason that this non-ns column takes more time is due to the conversion > to nanoseconds. > So one way this could be to already cast your arrow table to ns before > converting to pandas (this is implemented in arrow, but currently does not > check of out of bounds values; I opened > https://issues.apache.org/jira/browse/ARROW-6704 for this). > > But, I am not sure it can be improved with a big factor on the pandas side. > Let's look at some timings using: > > arr_int = np.arange(100000) > arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S')) > arr_us = arr_ns.astype("datetime64[us]") > > conversion from those arrays to a pandas Series only takes more time for > the microsecond datetimes: > > In [6]: %timeit pd.Series(arr_int) > 47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > > In [7]: %timeit pd.Series(arr_ns) > 27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > > In [8]: %timeit pd.Series(arr_us) > 6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > But just converting microseconds to nanoseconds also already takes almost a > millisecond on this data (with numpy, but arrow is not faster): > > In [15]: %timeit arr_us.astype("datetime64[ns]") > 742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > > So it should certainly be possible to get those 6ms down to something > around 1ms (and which can be a nice improvement! you're welcome to open an > issue on the pandas issue tracker for it), but it will still be factor of > 20 to 30 slower than the nanosecond case. > > Joris > > >>> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche < >> jorisvandenboss...@gmail.com> wrote: >>> >>> From looking a little bit further into this, it seems that it is mainly >>> pandas who is slower in creating a Series from an array of datetime64 >>> compared from an array of ints. >>> And especially if it is not nanosecond resolution: >>> >>> In [29]: a_int = pa.array(np.arange(100000)) >>> >>> In [30]: %timeit a_int.to_pandas() >>> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> >>> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000, >>> freq='S')) >>> >>> In [32]: %timeit a_datetime.to_pandas() >>> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> >>> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000, >>> freq='S'), pa.timestamp('us')) >>> >>> In [34]: %timeit a_datetime_us.to_pandas() >>> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> >>> Creating the datetime64 array inside pyarrow is also a bit slower >> compared >>> to int (causing the slower conversion of a_datetime), but the above >>> difference between between nanosecond and microsecond resolution is >> largely >>> due to pandas, not pyarrow (because pandas needs to convert the >>> microseconds to nanoseconds, and during that conversion will also check >>> that no datetimes were out of bounds for this resolution). >>> >>> And in parquet, the datetime data of the index column will be stored in >>> microsecond resolution (even if the original pandas data was nanosecond >>> resolution). And the slower reading of the parquet file with datetime >> index >>> is thus almost entirely due to the above difference in timing of >> converting >>> the int or datetime index column to pandas. >>> Parquet nowadays actually supports storing nanosecond resolution, and >> this >>> can be triggered in pyarrow by passing version="2.0" to pq.write_table >> (but >>> last what I heard this version is not yet considered production ready). >>> >>> Joris >>> >>> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche < >>> jorisvandenboss...@gmail.com> wrote: >>> >>>> Hi Maarten, >>>> >>>> Thanks for the reproducible script. I ran it on my laptop on pyarrow >>>> master, and not seeing the difference between both datetime indexes: >>>> >>>> Versions: >>>> Python: 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, >>>> 23:01:00) >>>> [GCC 7.3.0] on linux >>>> numpy: 1.16.4 >>>> pandas: 0.26.0.dev0+447.gc168ecf26 >>>> pyarrow: 0.14.1.dev642+g7f2d637db >>>> >>>> 1073741824 float64 8388608 16 >>>> 0: make_dataframe : 1443.483 msec, 709 MB/s >>>> 0: write_arrow_parquet : 7685.426 msec, 133 MB/s >>>> 0: read_arrow_parquet : 1262.741 msec, 811 MB/s <<< >>>> 1: make_dataframe : 1412.575 msec, 725 MB/s >>>> 1: write_arrow_parquet : 7869.145 msec, 130 MB/s >>>> 1: read_arrow_parquet : 1947.896 msec, 526 MB/s <<< >>>> 2: make_dataframe : 1490.165 msec, 687 MB/s >>>> 2: write_arrow_parquet : 7040.507 msec, 145 MB/s >>>> 2: read_arrow_parquet : 1888.316 msec, 542 MB/s <<< >>>> >>>> The only change I needed to make in the script to get it running (within >>>> my memory limits) was the creation of the second DatetimeIndex >>>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of >>>> 1.5 billion elements, while only the last part of it is used. So changed >>>> that to index = pd.date_range('2018-01-01', '2019-09-01', >>>> freq='S').array[-rows:]) >>>> >>>> The datetime index reading in general is still slower as the int index. >>>> But doing a bit more detailed timings, and it seems this is not due to >> the >>>> reading of parquet, but the conversion of arrow to pandas (using the >> files >>>> from the benchmark): >>>> >>>> In [1]: import pyarrow.parquet as pq >>>> >>>> In [4]: %timeit pq.read_table('testdata.int.parquet') >>>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>>> >>>> In [5]: %timeit pq.read_table('testdata.dt.parquet') >>>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>>> >>>> In [6]: table_int = pq.read_table('testdata.int.parquet') >>>> >>>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet') >>>> >>>> In [8]: %timeit table_int.to_pandas() >>>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>>> >>>> In [9]: %timeit table_datetime.to_pandas() >>>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>>> >>>> So you can see that the parquet reading part is basically identical, but >>>> the conversion to pandas is much slower for the datetime-index case. >>>> I will try to look into that code path to see what makes this so much >>>> slower. >>>> >>>> Joris >>>> >>>> >>>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <maart...@xs4all.nl> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> The code to show the performance issue with DateTimeIndex is at: >>>>> >>>>> >> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c >>>>> >>>>> It shows three case 0) int index, 1) datetime index, 2) date time index >>>>> created in a slightly roundabout way >>>>> >>>>> I’m a little confused by the two date time cases. Case 2) is much >> slower >>>>> but the df compares identical to case 1) >>>>> (I originally used something like 2) to match our specific data. I >> don’t >>>>> see why it behaves differently??) >>>>> >>>>> The timings I find are: >>>>> >>>>> 1073741824 float64 8388608 16 >>>>> 0: make_dataframe : 2390.830 msec, 428 MB/s >>>>> 0: write_arrow_parquet : 2486.463 msec, 412 MB/s >>>>> 0: read_arrow_parquet : 813.946 msec, 1258 MB/s <<< >>>>> 1: make_dataframe : 2579.815 msec, 397 MB/s >>>>> 1: write_arrow_parquet : 2708.151 msec, 378 MB/s >>>>> 1: read_arrow_parquet : 1413.999 msec, 724 MB/s <<< >>>>> 2: make_dataframe : 15126.520 msec, 68 MB/s >>>>> 2: write_arrow_parquet : 9205.815 msec, 111 MB/s >>>>> 2: read_arrow_parquet : 5929.346 msec, 173 MB/s <<< >>>>> >>>>> Case 0, int index. This is all great. >>>>> Case 1, date time index. We loose almost half the speed. Given that a >>>>> datetime is only scaled from Pandas IIRC that seems like a lot? >>>>> Case 3, other datetime index. No idea what is going on. >>>>> >>>>> Any insights are much appreciated. >>>>> >>>>> Cheers, >>>>> Maarten. >>>>> >>>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmck...@gmail.com> >> wrote: >>>>>> >>>>>> hi >>>>>> >>>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn < >> maart...@xs4all.nl >>>>> <mailto:maart...@xs4all.nl>> wrote: >>>>>>> >>>>>>> Hi Wes, >>>>>>> >>>>>>> Thanks for your quick response. >>>>>>> >>>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: >>>>>>> >>>>>>> numpy: 1.16.5 >>>>>>> pandas: 0.25.1 >>>>>>> pyarrow: 0.14.1 >>>>>>> >>>>>>> It looks like 0.15 is close, so I can wait for that. >>>>>>> >>>>>>> Theoretically I see three components driving the performance: >>>>>>> 1) The cost of locating the column (directory overhead) >>>>>>> 2) The overhead of reading a single column. (reading and processing >>>>> meta data, setting up for reading) >>>>>>> 3) Bulk reading and unmarshalling/decoding the data. >>>>>>> >>>>>>> Only 1) would be impacted by the number of columns, but if you’re >>>>> reading everything ideally this would not be a problem. >>>>>> >>>>>> The problem is more nuanced than that. Parquet's metadata is somewhat >>>>>> "heavy" at the column level. So when you're writing thousands of >>>>>> columns, the fixed overhead associated with reading a single column >>>>>> becomes problematic. There are several data structures associated with >>>>>> decoding a column have a fixed setup and teardown cost. Even if there >>>>>> is 1 millisecond of fixed overhead related to reading a column (I >>>>>> don't know what the number is exactly) then reading 10,000 columns has >>>>>> 10 seconds of unavoidable overhead. It might be useful for us to >>>>>> quantify and communicate the expected overhead when metadata and >>>>>> decoding is taken into account. Simply put having more than 1000 >>>>>> columns is not advisable. >>>>>> >>>>>>> Based on an initial cursory look at the Parquet format I guess the >>>>> index and the column meta-data might need to be read in full so I can >> see >>>>> how that might slow down reading only a few columns out of a large >> set. But >>>>> that was not really the case here? >>>>>>> >>>>>>> What would you suggest for looking into the date index slow-down? >>>>>> >>>>>> Can you show a code example to make things easier for us to see what >>>>>> you're seeing? >>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Maarten. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> >>>>> wrote: >>>>>>>> >>>>>>>> hi Maarten, >>>>>>>> >>>>>>>> Are you using the master branch or 0.14.1? There are a number of >>>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the >>>>>>>> master branch, to appear as 0.15.0 relatively soon. >>>>>>>> >>>>>>>> As a file format, Parquet (and columnar formats in general) is not >>>>>>>> known to perform well with more than 1000 columns. >>>>>>>> >>>>>>>> On the other items, we'd be happy to work with you to dig through >> the >>>>>>>> performance issues you're seeing. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wes >>>>>>>> >>>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn < >>>>> maart...@xs4all.nl> wrote: >>>>>>>>> >>>>>>>>> Greetings, >>>>>>>>> >>>>>>>>> We have Pandas DataFrames with typically about 6,000 rows using >>>>> DateTimeIndex. >>>>>>>>> They have about 20,000 columns with integer column labels, and data >>>>> with a dtype of float32. >>>>>>>>> >>>>>>>>> We’d like to store these dataframes with parquet, using the ability >>>>> to read a subset of columns and to store meta-data with the file. >>>>>>>>> >>>>>>>>> We’ve found the reading performance less than expected compared to >>>>> the published benchmarks (e.g. Wes’ blog post). >>>>>>>>> >>>>>>>>> Using a modified version of his script we did reproduce his results >>>>> (~ 1GB/s for high entropy, no dict on MacBook pro) >>>>>>>>> >>>>>>>>> But there seem to be three factors that contribute to the slowdown >>>>> for our datasets: >>>>>>>>> >>>>>>>>> - DateTimeIndex is much slower then a Int index (we see about a >>>>> factor 5). >>>>>>>>> - The number of columns impact reading speed significantly (factor >>>>> ~2 going from 16 to 16,000 columns) >>>>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly >>>>> and appears unnecessary? (about 40%) >>>>>>>>> >>>>>>>>> Are there ways we could speedup the reading? Should we use a >>>>> different layout? >>>>>>>>> >>>>>>>>> Thanks for your help and insights! >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Maarten >>>>>>>>> >>>>>>>>> >>>>>>>>> ps. the routines we used: >>>>>>>>> >>>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None: >>>>>>>>> table = pa.Table.from_pandas(df) >>>>>>>>> pq.write_table(table, fname, use_dictionary=False, >>>>> compression=None) >>>>>>>>> return >>>>>>>>> >>>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame: >>>>>>>>> table = pq.read_table(fname, use_pandas_metadata=False, >>>>> use_threads=True) >>>>>>>>> df = table.to_pandas() >>>>>>>>> return df >>>>> >>>>> >> >>