[jira] [Updated] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elena Henderson updated ARROW-11469: Attachment: image-2021-05-03-14-31-41-260.png > [Python] Performance degradation parquet reading of wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: image-2021-05-03-14-31-41-260.png, profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11469: -- Summary: [Python] Performance degradation parquet reading of wide dataframes (was: [Python] Performance degradation wide dataframes) > [Python] Performance degradation parquet reading of wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)