[jira] [Comment Edited] (ARROW-16775) pyarrow's read_table is way slower than iter_batches

Weston Pace (Jira) Wed, 08 Jun 2022 07:39:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551652#comment-17551652
 ]


Weston Pace edited comment on ARROW-16775 at 6/8/22 2:38 PM:
-------------------------------------------------------------

What results do you get at a slightly smaller scale (e.g. {{10 * * 8}})?  I get 
an out-of-memory error at 10**9 (this is ~16GB of data which is the limit on my 
system).  At {{10 * * 8}} I get the following timings:

{noformat}
table_of_whole_file: 0.6154186725616455
table_of_batches: 0.8553369045257568
table_of_one_batch: 0.6191871166229248
{noformat}

I wonder if the problem is that we are hitting swap and {{table_of_whole_file}} 
performs poorly when using swap.  I'm not sure how much we want to optimize in 
that case vs. suggesting the data be consumed iteratively.


was (Author: westonpace):
What results do you get at a slightly smaller scale (e.g. {{10**8}})?  I get an 
out-of-memory error at 10**9 (this is ~16GB of data which is the limit on my 
system).  At {{10**8}} I get the following timings:

{noformat}
table_of_whole_file: 0.6154186725616455
table_of_batches: 0.8553369045257568
table_of_one_batch: 0.6191871166229248
{noformat}

I wonder if the problem is that we are hitting swap and {{table_of_whole_file}} 
performs poorly when using swap.  I'm not sure how much we want to optimize in 
that case vs. suggesting the data be consumed iteratively.

> pyarrow's read_table is way slower than iter_batches
> ----------------------------------------------------
>
>                 Key: ARROW-16775
>                 URL: https://issues.apache.org/jira/browse/ARROW-16775
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 8.0.0
>         Environment: pyarrow 8.0.0
> pandas 1.4.2
> numpy 1.22.4
> python 3.9
> I reproduced this behaviour on two machines: 
> * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
> * pytorch docker container on standard linux machine
>            Reporter: Satoshi Nakamoto
>            Priority: Critical
>
> Hi!
> Loading a table created from DataFrame  `pyarrow.parquet.read_table()` is 
> taking 3x  much time as loading it as batches
>  
> {code:java}
> pyarrow.Table.from_batches(
>     list(pyarrow.parquet.ParquetFile.iter_batches()
> ){code}
>  
> h4. Minimal example
>  
> {code:java}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(
>     {
>         "a": np.random.random(10**9), 
>         "b": np.random.random(10**9)
>     }
> )
> df.to_parquet("file.parquet")
> table_of_whole_file = pq.read_table("file.parquet")
> table_of_batches = pa.Table.from_batches(
>     list(
>         pq.ParquetFile("file.parquet").iter_batches()
>     )
> )
> table_of_one_batch = pa.Table.from_batches(
>     [
>         next(pq.ParquetFile("file.parquet")
>         .iter_batches(batch_size=10**9))
>     ]
> ){code}
>  
> _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read 
> time is 33.2s.
> Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
> h4. Parquet file metadata
>  
> {code:java}
> <pyarrow._parquet.FileMetaData object at 0x129ab83b0>
>   created_by: parquet-cpp-arrow version 8.0.0
>   num_columns: 2
>   num_rows: 1000000000
>   num_row_groups: 15
>   format_version: 1.0
>   serialized_size: 5680 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (ARROW-16775) pyarrow's read_table is way slower than iter_batches

Reply via email to