[ https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-15312: ----------------------------------- Priority: Major (was: Blocker) > [R][C++] filtering a Parquet dataset with is.na() misses some rows > ------------------------------------------------------------------ > > Key: ARROW-15312 > URL: https://issues.apache.org/jira/browse/ARROW-15312 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 6.0.1 > Environment: R 4.1.2 on Windows > arrow 6.0.1 > dplyr 1.0.7 > Reporter: Pierre Gramme > Priority: Major > Fix For: 7.0.1, 8.0.0 > > > Hi ! > I just found an issue when querying an Arrow dataset with dplyr, filtering on > is.na(...) > It seems linked to columns containing only one distinct value and some NA's. > Can you also reproduce the following? > > {code:java} > library(arrow) > library(dplyr) > > ds_path = "test-arrow-na" > df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_)) > > df %>% arrow::write_dataset(ds_path) > > # OK: Collect then filter: returns row 3, as expected > arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y)) > # ERROR: Filter then collect (on y) returns a tibble with no row > arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect() > > # OK: Filter then collect (on z) returns row 3, as expected > arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {code} > > Thanks > Pierre -- This message was sent by Atlassian Jira (v8.20.1#820001)