[ https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551652#comment-17551652 ]
Weston Pace edited comment on ARROW-16775 at 6/8/22 2:38 PM: ------------------------------------------------------------- What results do you get at a slightly smaller scale (e.g. {{10 * * 8}})? I get an out-of-memory error at 10**9 (this is ~16GB of data which is the limit on my system). At {{10 * * 8}} I get the following timings: {noformat} table_of_whole_file: 0.6154186725616455 table_of_batches: 0.8553369045257568 table_of_one_batch: 0.6191871166229248 {noformat} I wonder if the problem is that we are hitting swap and {{table_of_whole_file}} performs poorly when using swap. I'm not sure how much we want to optimize in that case vs. suggesting the data be consumed iteratively. was (Author: westonpace): What results do you get at a slightly smaller scale (e.g. {{10**8}})? I get an out-of-memory error at 10**9 (this is ~16GB of data which is the limit on my system). At {{10**8}} I get the following timings: {noformat} table_of_whole_file: 0.6154186725616455 table_of_batches: 0.8553369045257568 table_of_one_batch: 0.6191871166229248 {noformat} I wonder if the problem is that we are hitting swap and {{table_of_whole_file}} performs poorly when using swap. I'm not sure how much we want to optimize in that case vs. suggesting the data be consumed iteratively. > pyarrow's read_table is way slower than iter_batches > ---------------------------------------------------- > > Key: ARROW-16775 > URL: https://issues.apache.org/jira/browse/ARROW-16775 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python > Affects Versions: 8.0.0 > Environment: pyarrow 8.0.0 > pandas 1.4.2 > numpy 1.22.4 > python 3.9 > I reproduced this behaviour on two machines: > * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge > * pytorch docker container on standard linux machine > Reporter: Satoshi Nakamoto > Priority: Critical > > Hi! > Loading a table created from DataFrame `pyarrow.parquet.read_table()` is > taking 3x much time as loading it as batches > > {code:java} > pyarrow.Table.from_batches( > list(pyarrow.parquet.ParquetFile.iter_batches() > ){code} > > h4. Minimal example > > {code:java} > import pandas as pd > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame( > { > "a": np.random.random(10**9), > "b": np.random.random(10**9) > } > ) > df.to_parquet("file.parquet") > table_of_whole_file = pq.read_table("file.parquet") > table_of_batches = pa.Table.from_batches( > list( > pq.ParquetFile("file.parquet").iter_batches() > ) > ) > table_of_one_batch = pa.Table.from_batches( > [ > next(pq.ParquetFile("file.parquet") > .iter_batches(batch_size=10**9)) > ] > ){code} > > _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read > time is 33.2s. > Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s. > h4. Parquet file metadata > > {code:java} > <pyarrow._parquet.FileMetaData object at 0x129ab83b0> > created_by: parquet-cpp-arrow version 8.0.0 > num_columns: 2 > num_rows: 1000000000 > num_row_groups: 15 > format_version: 1.0 > serialized_size: 5680 {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)