[jira] [Commented] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:56:02 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661491#comment-17661491
 ]


Rok Mihevc commented on ARROW-4470:
-----------------------------------

This issue has been migrated to [issue 
#21027|https://github.com/apache/arrow/issues/21027] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [Python] Pyarrow using considerable more memory when reading partitioned 
> Parquet file
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-4470
>                 URL: https://issues.apache.org/jira/browse/ARROW-4470
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.0
>            Reporter: Ivan SPM
>            Priority: Major
>              Labels: dataset, datasets, parquet
>             Fix For: 0.16.0
>
>
> Hi,
> I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
> with the following structure:
> {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}
> {{/data/myparquettable/year=2016/myfile_2.prt}}
> {{/data/myparquettable/year=2016/myfile_3.prt}}
> {{/data/myparquettable/year=2017}}
> {{/data/myparquettable/year=2017/myfile_1.prt}}
> {{/data/myparquettable/year=2017/myfile_2.prt}}
> {{/data/myparquettable/year=2017/myfile_3.prt}}
> and so on. I need to work with one partition, so I copied one partition to a 
> local filesystem:
> {{hdfs fs -get /data/myparquettable/year=2017 /local/}}
> so now I have some data on the local disk:
> {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}
> etc.I tried to read it using Pyarrow:
> {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}
> and it starts reading. The problem is that the local Parquet files are around 
> 15GB total, and I blew up my machine memory a couple of times because when 
> reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
> how much it will take because it never finishes. Is this expected? Is there a 
> workaround?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

Reply via email to