[jira] [Comment Edited] (ARROW-17913) feather.read_table 150x slower when reading columns in newer versions

Joris Van den Bossche (Jira) Mon, 03 Oct 2022 07:51:22 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612323#comment-17612323
 ]


Joris Van den Bossche edited comment on ARROW-17913 at 10/3/22 2:49 PM:
------------------------------------------------------------------------

I am not directly sure what <=6.0 did differently, but looking at the current 
implementation this is somewhat expected (it might still be that it can be 
implemented in a better way, of course): when specifying columns, it will read 
each column separately from the MemoryMappedFile (instead doing a single ReadAt 
call), and copying each read chunk in a single output buffer, and thus because 
of this copy the memory-mapping basically has no effect in this case 
(https://github.com/apache/arrow/blob/ec579df631deaa8f6186208ed2a4ebec00581dfa/cpp/src/arrow/io/file.h#L182-L185)

This can also be seen when you compare timings with and without memory mapping 
(with {{memory_map=False}}, there is no difference anymore between manually 
selecting all columns or not):

{code}
In [5]: %timeit feather.read_table('test.feather', columns=list(df.columns), 
memory_map=True)
29.4 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit feather.read_table('test.feather', columns=list(df.columns), 
memory_map=False)
35.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit feather.read_table('test.feather', memory_map=True)
239 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit feather.read_table('test.feather', memory_map=False)
35 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

Now, I would have assumed that it is not needed that all buffers of all columns 
live in a single body, so I am not 100% sure why it is needed to copy each 
field to a single output.


was (Author: jorisvandenbossche):
I am not directly sure what <=6.0 did differently, but looking at the current 
implementation this is somewhat expected (it might still be that it can be 
implemented in a better way, of course): when specifying columns, it will read 
each column separately from the MemoryMappedFile (instead doing a single ReadAt 
call), and copying each read chunk in a single output buffer, and thus because 
of this copy the memory-mapping basically has no effect in this case 
(https://github.com/apache/arrow/blob/ec579df631deaa8f6186208ed2a4ebec00581dfa/cpp/src/arrow/io/file.h#L182-L185)

This can also be seen when you compare timings with and without memory mapping 
(with {{memory_map=False}}, there is no difference anymore between manually 
selecting all columns or not):

{code}
In [5]: %timeit feather.read_table('test.feather', columns=list(df.columns), 
memory_map=True)
29.4 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit feather.read_table('test.feather', columns=list(df.columns), 
memory_map=False)
35.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit feather.read_table('test.feather', memory_map=True)
239 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit feather.read_table('test.feather', memory_map=False)
35 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

Now, I would have assumed that it is not needed that all buffers of all columns 
live in a single memory chunk, so I am not 100% sure why it is needed to copy 
each field to a single output.

> feather.read_table 150x slower when reading columns in newer versions
> ---------------------------------------------------------------------
>
>                 Key: ARROW-17913
>                 URL: https://issues.apache.org/jira/browse/ARROW-17913
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0, 8.0.0, 9.0.0
>         Environment: python 3.9, ubuntu 20.04
>            Reporter: Håkon Magne Holmen
>            Priority: Major
>              Labels: feather, performance
>
> h3. Description
> Performance when reading columns using {{feather.read_table}} on Arrow 
> 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
> Profiling the code below shows that the bottleneck is somewhere in the 
> {{read_names}} function of {{pyarrow._feather.FeatherReader}}.
> h5. Example
> Setup code:
> {code}
> import pandas as pd
> from pyarrow import feather
> rows, cols = (1_000_000, 10)
> data = {f'c{c}': range(rows) for c in range(cols)}
> df = pd.DataFrame(data=data)
> feather.write_feather(df, 'test.feather', compression="uncompressed"){code} 
> Benchmarks Arrow 9.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), 
> memory_map=True)
> > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> Benchmarks Arrow 6.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), 
> memory_map=True)
> > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17913) feather.read_table 150x slower when reading columns in newer versions

Reply via email to