Jonathan Keane created ARROW-13848: -------------------------------------- Summary: [C++] and() in a dataset filter Key: ARROW-13848 URL: https://issues.apache.org/jira/browse/ARROW-13848 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jonathan Keane
Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned: {code:r} > library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > library(dplyr) > > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = > c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 46.634 4.462 6.457 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.633 0.421 0.754 > {code} I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds: {code:r} > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = > c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.549 0.404 0.576 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.477 0.412 0.585 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)