[ https://issues.apache.org/jira/browse/ARROW-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Keane updated ARROW-13848: ----------------------------------- Description: Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned: {code:r} > library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > library(dplyr) > > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = > c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 46.634 4.462 6.457 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.633 0.421 0.754 > {code} I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds: {code:r} > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = > c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.549 0.404 0.576 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.477 0.412 0.585 {code} This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`. was: Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned: {code:r} > library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > library(dplyr) > > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = > c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 46.634 4.462 6.457 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.633 0.421 0.754 > {code} I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds: {code:r} > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = > c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.549 0.404 0.576 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.477 0.412 0.585 {code} > [C++] and() in a dataset filter > ------------------------------- > > Key: ARROW-13848 > URL: https://issues.apache.org/jira/browse/ARROW-13848 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Jonathan Keane > Priority: Major > > Is it expected that a scanning a dataset that has a filter built with > {{and()}} is much slower than a filter built with {{and_kleene()}}? > Specifically, it seems that {{and()}} triggers a scan of the full dataset, > where as {{and_kleene()}} takes advantage of the fact that only one directory > of the larger dataset needs to be scanned: > {code:r} > > library(arrow) > Attaching package: ‘arrow’ > The following object is masked from ‘package:utils’: > timestamp > > library(dplyr) > > > > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = > > c("year", "month")) > > > > system.time({ > + out <- ds %>% > + filter(arrow_and(total_amount > 100, year == 2015)) %>% > + select(tip_amount, total_amount, passenger_count) %>% > + collect() > + }) > user system elapsed > 46.634 4.462 6.457 > > > > system.time({ > + out <- ds %>% > + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% > + select(tip_amount, total_amount, passenger_count) %>% > + collect() > + }) > user system elapsed > 4.633 0.421 0.754 > > > {code} > I suspect that it's scanning the whole dataset because if I use a dataset > that only has the 2015 folder, I get similar speeds: > {code:r} > > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning > > = c("year", "month")) > > > > system.time({ > + out <- ds %>% > + filter(arrow_and(total_amount > 100, year == 2015)) %>% > + select(tip_amount, total_amount, passenger_count) %>% > + collect() > + }) > user system elapsed > 4.549 0.404 0.576 > > > > system.time({ > + out <- ds %>% > + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% > + select(tip_amount, total_amount, passenger_count) %>% > + collect() > + }) > user system elapsed > 4.477 0.412 0.585 > {code} > This does not impact anyone who uses our default collapsing mechanism in the > R package, but I bumped into it with a filter that was constructed by duckdb > using `and()` instead of `and_kleene()`. -- This message was sent by Atlassian Jira (v8.3.4#803005)