[ https://issues.apache.org/jira/browse/ARROW-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-15726: -------------------------------- Component/s: C++ (was: R) > [R] Support push-down projection/filtering in datasets / dplyr > -------------------------------------------------------------- > > Key: ARROW-15726 > URL: https://issues.apache.org/jira/browse/ARROW-15726 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > > The following query should read a single column from the target parquet file. > {noformat} > open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) > %>% collect() > {noformat} > Furthermore, it should apply a pushdown filter to the source node allowing > parquet row groups to potentially filter out target data. > At the moment it creates the following exec plan: > {noformat} > 3:SinkNode{} > 2:ProjectNode{projection=[l_tax]} > 1:FilterNode{filter=(l_tax < 0.01)} > 0:SourceNode{} > {noformat} > There is no projection or filter in the source node. As a result we end up > reading much more data from disk (the entire file) than we need to (at most a > single column). > This _could_ be fixed via heuristics in the dplyr code. However, it may > quickly get complex (for example, the project comes after the filter, so you > need to make sure you push down a projection that includes both the columns > accessed by the filter and the columns accessed by the projection OR can you > push down the projection through a join [yes you can], how do you know which > columns to apply to which source node). > A more complete fix would be to call into some kind of 3rd party optimizer > (e.g. calcite) after the plan has been created by dplyr but before it is > passed to the execution engine. -- This message was sent by Atlassian Jira (v8.20.1#820001)