[ https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298172#comment-17298172 ]
Ben Kietzman commented on ARROW-7224: ------------------------------------- Also worth noting: the case discussed in this thread (of a filter which references each partition field exactly once and specifies an equality condition for each) corresponds to a single subdirectory which needs to be scanned. This is not the case for all filters, but it would be possible to add a special case when such a prefix can be extracted. This would require the {{Partitioning}} be explicitly constructed (so that we know without inspection of paths what partition fields are in play), but that's fairly straightforward. > [C++][Dataset] Partition level filters should be able to provide filtering to > file systems > ------------------------------------------------------------------------------------------ > > Key: ARROW-7224 > URL: https://issues.apache.org/jira/browse/ARROW-7224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Micah Kornfield > Priority: Major > Labels: dataset > > When providing a filter for partitions, it should be possible in some cases > to use it to optimize file system list calls. This can greatly improve the > speed for reading data from partitions because fewer number of > directories/files need to be explored/expanded. I've fallen behind on the > dataset code, but I want to make sure this issue is tracked someplace. This > came up in SO question linked below (feel free to correct my analysis if I > missed the functionality someplace). > Reference: > [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477] -- This message was sent by Atlassian Jira (v8.3.4#803005)