[jira] [Commented] (ARROW-17913) feather.read_table 150x slower when reading columns in newer versions

Joris Van den Bossche (Jira) Mon, 03 Oct 2022 07:51:26 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612326#comment-17612326
 ]


Joris Van den Bossche commented on ARROW-17913:
-----------------------------------------------

The {{ReadFieldsSubset}} that does this copying of each field into a single 
body, was added in https://github.com/apache/arrow/pull/11486 / ARROW-12683 
(which would match the observation that this changed in 6.0 -> 7.0)

I suppose before that change, we always loaded the full record batch in memory 
even when only reading a subset of the fields. As long as you are memory 
mapping, that might actually be OK, but if you are accessing data eg from S3, 
you want to read only those buffers that you need to avoid too much IO.

cc @David Li

> feather.read_table 150x slower when reading columns in newer versions
> ---------------------------------------------------------------------
>
>                 Key: ARROW-17913
>                 URL: https://issues.apache.org/jira/browse/ARROW-17913
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0, 8.0.0, 9.0.0
>         Environment: python 3.9, ubuntu 20.04
>            Reporter: Håkon Magne Holmen
>            Priority: Major
>              Labels: feather, performance
>
> h3. Description
> Performance when reading columns using {{feather.read_table}} on Arrow 
> 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
> Profiling the code below shows that the bottleneck is somewhere in the 
> {{read_names}} function of {{pyarrow._feather.FeatherReader}}.
> h5. Example
> Setup code:
> {code}
> import pandas as pd
> from pyarrow import feather
> rows, cols = (1_000_000, 10)
> data = {f'c{c}': range(rows) for c in range(cols)}
> df = pd.DataFrame(data=data)
> feather.write_feather(df, 'test.feather', compression="uncompressed"){code} 
> Benchmarks Arrow 9.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), 
> memory_map=True)
> > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> Benchmarks Arrow 6.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), 
> memory_map=True)
> > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17913) feather.read_table 150x slower when reading columns in newer versions

Reply via email to