[ 
https://issues.apache.org/jira/browse/ARROW-15724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-15724:
------------------------
    Summary: [C++] Reduce directory and file IO when reading partition parquet 
dataset with partition key filters  (was: [C++] Reduce directory and file IO 
when reading partition parquet dataset)

> [C++] Reduce directory and file IO when reading partition parquet dataset 
> with partition key filters
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15724
>                 URL: https://issues.apache.org/jira/browse/ARROW-15724
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yin
>            Priority: Major
>              Labels: dataset
>         Attachments: pq.py
>
>
> Hi,
> It seems that Arrow accesses all partitions directories (and even each 
> parquet files), including those clearly not matching with the partition key 
> values in the filter criteria. This may cause multiple time of difference 
> between accessing one partition directly vs accessing with partition key 
> filters, 
> specially on Network file system, and on local file system when there are 
> lots of partitions, e.g. 1/10th of second vs seconds.
> Attached some Python code to create example dataframe and save parquet 
> datasets with different hive partition structures (/y=/m=/d=, or /y=/m=, or 
> /dk=). And read the datasets with/without filters to reproduce the issue. 
> Observe the run time, and the directories and files which are accessed by the 
> process in Process Monitor on Windows.
> In the three partition structures, I saw in Process Monitor that all 
> directories are accessed regardless of use_legacy_dataset=True or False. 
> When use_legacy_dataset=False, the parquet files in all directories were 
> opened and closed.  
> The argument validate_schema=False made small time difference, but still 
> opens the partition directories, and it's only supported when 
> use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
> wrapper API. 
> The /y=/m= is faster because there is no daily partition so less directories 
> and files.
> There was a related another stackoverflow question and example 
> [https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
> and there was a comment on the partition discovery:
> {quote}It should get discovered automatically. pd.read_parquet calls 
> pyarrow.parquet.read_table and the default partitioning behavior should be to 
> discover hive-style partitions (i.e. the ones you have). The fact that you 
> have to specify this means that discovery is failing. If you could create a 
> reproducible example and submit it to Arrow JIRA it would be helpful. 
> – Pace  Feb 24 2021 at 18:55"
> {quote}
> Wonder if there were some related Jira here already.
> I tried passing in partitioning argument, but it didn't help. 
> The version of pyarrow used were 1.01, 5, and 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to