Neal Richardson created ARROW-18286: ---------------------------------------
Summary: [R] Troubles with using augmented columns Key: ARROW-18286 URL: https://issues.apache.org/jira/browse/ARROW-18286 Project: Apache Arrow Issue Type: Bug Components: C++, R Reporter: Neal Richardson We can project to add augmented fields like {{__filename}}, but there are a few catches. Given: {code} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ds <- InMemoryDataset$create(mtcars) %>% mutate(f = add_filename()) show_query(ds) #> ExecPlan with 3 nodes: #> 2:SinkNode{} #> 1:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, "f": __filename]} #> 0:SourceNode{} collect(ds) #> mpg cyl disp hp drat wt qsec vs am gear carb f #> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 in-memory #> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 in-memory #> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 in-memory ... {code} Issue #1: you can't filter on that column because (my theory, based on the evidence) the ScanNode takes a projection and filter, but the filter is not evaluated with the augmented schema, so it doesn't find {{__filename}}. This seems fixable in C++. {code} ds %>% filter(f == "in-memory") %>% collect() #> Error in `collect()`: #> ! Invalid: No match for FieldRef.Name(__filename) in mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> ℹ `add_filename()` or use of the `__filename` augmented field can only be used with with Dataset objects, and can only be added before doing an aggregation or a join. #> Backtrace: #> ▆ #> 1. ├─ds %>% filter(f == "in-memory") %>% collect() #> 2. ├─dplyr::collect(.) #> 3. └─arrow:::collect.arrow_dplyr_query(.) #> 4. └─base::tryCatch(...) #> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 7. └─value[[3L]](cond) #> 8. └─arrow:::augment_io_error_msg(e, call, schema = x$.data$schema) #> 9. └─arrow:::handle_augmented_field_misuse(msg, call) #> 10. └─rlang::abort(msg, call = call) {code} Proof that it is in the ScanNode: If we {{collapse()}} the query after projecting to include filename but before the filter, the filter doesn't get included in the ScanNode, it's only applied after, as a FilterNode. This works: {code} ds %>% collapse() %>% filter(f == "in-memory") %>% collect() #> mpg cyl disp hp drat wt qsec vs am gear carb f #> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 in-memory #> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 in-memory #> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 in-memory ... {code} A related failure mode: you have to first project to include the augmented column, you can't just include it in a filter: {code} InMemoryDataset$create(mtcars) %>% filter(add_filename() == "in-memory") %>% collect() #> Error in `collect()`: #> ! Invalid: No match for FieldRef.Name(__filename) in mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> ℹ `add_filename()` or use of the `__filename` augmented field can only be used with with Dataset objects, and can only be added before doing an aggregation or a join. #> Backtrace: #> ▆ #> 1. ├─... %>% collect() #> 2. ├─dplyr::collect(.) #> 3. └─arrow:::collect.arrow_dplyr_query(.) #> 4. └─base::tryCatch(...) #> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 7. └─value[[3L]](cond) #> 8. └─arrow:::augment_io_error_msg(e, call, schema = x$.data$schema) #> 9. └─arrow:::handle_augmented_field_misuse(msg, call) #> 10. └─rlang::abort(msg, call = call) {code} Issue #2, following on that: you can only add the augmented fields at the start of the query, something that goes in the ScanNode. This seems like something we would have to catch in R and error at the time add_filename() is called. That could probably be covered in ARROW-17356. {code} InMemoryDataset$create(mtcars) %>% collapse() %>% collapse() %>% filter(add_filename() == "in-memory") %>% collect() #> Error in `collect()`: #> ! Invalid: No match for FieldRef.Name(__filename) in mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> ℹ `add_filename()` or use of the `__filename` augmented field can only be used with with Dataset objects, and can only be added before doing an aggregation or a join. #> Backtrace: #> ▆ #> 1. ├─... %>% collect() #> 2. ├─dplyr::collect(.) #> 3. └─arrow:::collect.arrow_dplyr_query(.) #> 4. └─base::tryCatch(...) #> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 7. └─value[[3L]](cond) #> 8. └─arrow:::augment_io_error_msg(e, call, schema = x$.data$schema) #> 9. └─arrow:::handle_augmented_field_misuse(msg, call) #> 10. └─rlang::abort(msg, call = call) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)