Neal Richardson created ARROW-18286:
---------------------------------------

             Summary: [R] Troubles with using augmented columns
                 Key: ARROW-18286
                 URL: https://issues.apache.org/jira/browse/ARROW-18286
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, R
            Reporter: Neal Richardson


We can project to add augmented fields like {{__filename}}, but there are a few 
catches. Given:

{code}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

ds <- InMemoryDataset$create(mtcars) %>%
  mutate(f = add_filename())

show_query(ds)
#> ExecPlan with 3 nodes:
#> 2:SinkNode{}
#>   1:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, 
gear, carb, "f": __filename]}
#>     0:SourceNode{}

collect(ds)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb         f
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 in-memory
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 in-memory
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 in-memory
...
{code}

Issue #1: you can't filter on that column because (my theory, based on the 
evidence) the ScanNode takes a projection and filter, but the filter is not 
evaluated with the augmented schema, so it doesn't find {{__filename}}. This 
seems fixable in C++.

{code}
ds %>%
  filter(f == "in-memory") %>%
  collect()
#> Error in `collect()`:
#> ! Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> ℹ `add_filename()` or use of the `__filename` augmented field can only be 
used with with Dataset objects, and can only be added before doing an 
aggregation or a join.

#> Backtrace:
#>      ▆
#>   1. ├─ds %>% filter(f == "in-memory") %>% collect()
#>   2. ├─dplyr::collect(.)
#>   3. └─arrow:::collect.arrow_dplyr_query(.)
#>   4.   └─base::tryCatch(...)
#>   5.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   6.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7.         └─value[[3L]](cond)
#>   8.           └─arrow:::augment_io_error_msg(e, call, schema = 
x$.data$schema)
#>   9.             └─arrow:::handle_augmented_field_misuse(msg, call)
#>  10.               └─rlang::abort(msg, call = call)
{code}

Proof that it is in the ScanNode: If we {{collapse()}} the query after 
projecting to include filename but before the filter, the filter doesn't get 
included in the ScanNode, it's only applied after, as a FilterNode. This works:

{code}
ds %>%
  collapse() %>%
  filter(f == "in-memory") %>%
  collect()
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb         f
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 in-memory
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 in-memory
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 in-memory
...
{code}

A related failure mode: you have to first project to include the augmented 
column, you can't just include it in a filter:

{code}
InMemoryDataset$create(mtcars) %>%
  filter(add_filename() == "in-memory") %>%
  collect()
#> Error in `collect()`:
#> ! Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> ℹ `add_filename()` or use of the `__filename` augmented field can only be 
used with with Dataset objects, and can only be added before doing an 
aggregation or a join.

#> Backtrace:
#>      ▆
#>   1. ├─... %>% collect()
#>   2. ├─dplyr::collect(.)
#>   3. └─arrow:::collect.arrow_dplyr_query(.)
#>   4.   └─base::tryCatch(...)
#>   5.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   6.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7.         └─value[[3L]](cond)
#>   8.           └─arrow:::augment_io_error_msg(e, call, schema = 
x$.data$schema)
#>   9.             └─arrow:::handle_augmented_field_misuse(msg, call)
#>  10.               └─rlang::abort(msg, call = call)
{code}

Issue #2, following on that: you can only add the augmented fields at the start 
of the query, something that goes in the ScanNode. This seems like something we 
would have to catch in R and error at the time add_filename() is called. That 
could probably be covered in ARROW-17356.

{code}
InMemoryDataset$create(mtcars) %>%
  collapse() %>%
  collapse() %>%
  filter(add_filename() == "in-memory") %>%
  collect()
#> Error in `collect()`:
#> ! Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> ℹ `add_filename()` or use of the `__filename` augmented field can only be 
used with with Dataset objects, and can only be added before doing an 
aggregation or a join.

#> Backtrace:
#>      ▆
#>   1. ├─... %>% collect()
#>   2. ├─dplyr::collect(.)
#>   3. └─arrow:::collect.arrow_dplyr_query(.)
#>   4.   └─base::tryCatch(...)
#>   5.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   6.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7.         └─value[[3L]](cond)
#>   8.           └─arrow:::augment_io_error_msg(e, call, schema = 
x$.data$schema)
#>   9.             └─arrow:::handle_augmented_field_misuse(msg, call)
#>  10.               └─rlang::abort(msg, call = call)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to