[ 
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298181#comment-17298181
 ] 

Micah Kornfield commented on ARROW-7224:
----------------------------------------

{quote}Also worth noting: the case discussed in this thread (of a filter which 
references each partition field exactly once and specifies an equality 
condition for each) corresponds to a single subdirectory which needs to be 
scanned. This is not the case for all filters, but it would be possible to add 
a special case when such a prefix can be extracted. This would require the 
{{Partitioning}} be explicitly constructed (so that we know without inspection 
of paths what partition fields are in play), but that's fairly straightforward.
{quote}
Agreed.  I'll note that there is actually an interaction with the file system 
here as well, but specialization for equality is a starting point.  For 
instance >= partition keys is achievable for S3, as well as a range  lb <= 
column <= ub.  But strict less then/less than or equal would not achieve the 
same efficiencies.  FWIW, Spark as has APIs for push-down predicates that allow 
a source to tell it which predicates it can be pushed down effectively and 
which need to be done as part of the engine (i.e. using compute kernels).  A 
similar abstraction might be useful here.

> [C++][Dataset] Partition level filters should be able to provide filtering to 
> file systems
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases 
> to use it to optimize file system list calls.  This can greatly improve the 
> speed for reading data from partitions because fewer number of 
> directories/files need to be explored/expanded.  I've fallen behind on the 
> dataset code, but I want to make sure this issue is tracked someplace.  This 
> came up in SO question linked below (feel free to correct my analysis if I 
> missed the functionality someplace).
> Reference: 
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to