[ https://issues.apache.org/jira/browse/ARROW-15724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin updated ARROW-15724: ------------------------ Summary: [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters (was: [C++] Reduce directory and file IO when reading partition parquet dataset) > [C++] Reduce directory and file IO when reading partition parquet dataset > with partition key filters > ---------------------------------------------------------------------------------------------------- > > Key: ARROW-15724 > URL: https://issues.apache.org/jira/browse/ARROW-15724 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Yin > Priority: Major > Labels: dataset > Attachments: pq.py > > > Hi, > It seems that Arrow accesses all partitions directories (and even each > parquet files), including those clearly not matching with the partition key > values in the filter criteria. This may cause multiple time of difference > between accessing one partition directly vs accessing with partition key > filters, > specially on Network file system, and on local file system when there are > lots of partitions, e.g. 1/10th of second vs seconds. > Attached some Python code to create example dataframe and save parquet > datasets with different hive partition structures (/y=/m=/d=, or /y=/m=, or > /dk=). And read the datasets with/without filters to reproduce the issue. > Observe the run time, and the directories and files which are accessed by the > process in Process Monitor on Windows. > In the three partition structures, I saw in Process Monitor that all > directories are accessed regardless of use_legacy_dataset=True or False. > When use_legacy_dataset=False, the parquet files in all directories were > opened and closed. > The argument validate_schema=False made small time difference, but still > opens the partition directories, and it's only supported when > use_legacy_dataset=True, and not supported/passed in from pandas read_parquet > wrapper API. > The /y=/m= is faster because there is no daily partition so less directories > and files. > There was a related another stackoverflow question and example > [https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow] > and there was a comment on the partition discovery: > {quote}It should get discovered automatically. pd.read_parquet calls > pyarrow.parquet.read_table and the default partitioning behavior should be to > discover hive-style partitions (i.e. the ones you have). The fact that you > have to specify this means that discovery is failing. If you could create a > reproducible example and submit it to Arrow JIRA it would be helpful. > – Pace Feb 24 2021 at 18:55" > {quote} > Wonder if there were some related Jira here already. > I tried passing in partitioning argument, but it didn't help. > The version of pyarrow used were 1.01, 5, and 7. -- This message was sent by Atlassian Jira (v8.20.1#820001)