[ https://issues.apache.org/jira/browse/ARROW-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-11415: ----------------------------------- Labels: pull-request-available (was: ) > [R] experimental map_batches cannot find columns > ------------------------------------------------ > > Key: ARROW-11415 > URL: https://issues.apache.org/jira/browse/ARROW-11415 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 2.0.0 > Reporter: Gabriel Bassett > Assignee: Will Jones > Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > With dataset: > > {code:java} > Schema > X3: timestamp[us] > user_id: dictionary<values=string, indices=int32> > classification_name: dictionary<values=string, indices=int32> > X2: string > X1: dictionary<values=string, indices=int32> > X4: string > X5: dictionary<values=string, indices=int32> > X6: dictionary<values=string, indices=int32> > {code} > The following succeeds: > {code:java} > chunk <- ds %>% > select(user_id) %>% > collect() %>% > count(user_id) %>% > as_tibble() %>% > count(user_id, wt=n) > {code} > While the following fails: > {code:java} > chunk <- ds %>% > select(user_id) %>% > arrow::map_batches(~count(., user_id)) %>% > as_tibble() %>% > count(user_id, wt=x) > {code} > With error: > {code:java} > Error: Can't subset columns that don't exist. > ✖ Column `.drop` doesn't exist. > Traceback: > 1. ds %>% select(user_id) %>% arrow::map_batches(~count(., > . user_id)) %>% as_tibble() %>% count(user_id, wt = x) > 2. count(., user_id, wt = x) > 3. group_by(x, ..., .add = TRUE, .drop = .drop) > 4. as_tibble(.) > 5. arrow::map_batches(., ~count(., user_id)) > 6. lapply(scanner$Scan(), function(scan_task) { > . lapply(scan_task$Execute(), function(batch) { > . FUN(batch, ...) > . }) > . }) > 7. map(.x, .f, ...) > 8. .f(.x[[i]], ...) > 9. lapply(scan_task$Execute(), function(batch) { > . FUN(batch, ...) > . }) > 10. map(.x, .f, ...) > 11. .f(.x[[i]], ...) > 12. FUN(batch, ...) > 13. count(., user_id) > 14. tally(out, wt = !!enquo(wt), sort = sort, name = name) > 15. (function() { > . old.options <- options(dplyr.summarise.inform = FALSE) > . on.exit(options(old.options)) > . summarise(x, `:=`(!!name, !!n)) > . })() > 16. summarise(x, `:=`(!!name, !!n)) > 17. summarise.arrow_dplyr_query(x, `:=`(!!name, !!n)) > 18. dplyr::select(.data, vars_to_keep) > 19. select.arrow_dplyr_query(.data, vars_to_keep) > 20. column_select(arrow_dplyr_query(.data), !!!enquos(...)) > 21. .FUN(names(.data), !!!enquos(...)) > 22. eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, > . exclude = .exclude, strict = .strict, name_spec = unique_name_spec, > . uniquely_named = TRUE) > 23. with_subscript_errors(vars_select_eval(vars, expr, strict, data = x, > . name_spec = name_spec, uniquely_named = uniquely_named, allow_rename > = allow_rename, > . type = type), type = type) > 24. tryCatch(instrument_base_errors(expr), vctrs_error_subscript = > function(cnd) { > . cnd$subscript_action <- subscript_action(type) > . cnd$subscript_elt <- "column" > . cnd_signal(cnd) > . }) > 25. tryCatchList(expr, classes, parentenv, handlers) > 26. tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 27. value[[3L]](cond) > 28. cnd_signal(cnd) > 29. rlang:::signal_abort(x) > {code} > The dataset is 8 parquet files with no hive partitioning. > > sessionInfo(): > {code:java} > R version 4.0.3 (2020-10-10) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 18.04.3 LTSMatrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 > LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.solocale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base > packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 > [5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 > [9] tidyverse_1.3.0 dtplyr_1.0.1 data.table_1.13.2loaded via a > namespace (and not attached): > [1] Rcpp_1.0.5 lubridate_1.7.9.2 aws.ec2metadata_0.2.0 > [4] ps_1.5.0 arrow_2.0.0 assertthat_0.2.1 > [7] digest_0.6.27 utf8_1.1.4 aws.signature_0.6.0 > [10] mime_0.9 IRdisplay_0.7.0 R6_2.5.0 > [13] cellranger_1.1.0 repr_1.1.0 backports_1.2.0 > [16] reprex_0.3.0 evaluate_0.14 httr_1.4.2 > [19] pillar_1.4.7 rlang_0.4.9 curl_4.3 > [22] uuid_0.1-4 readxl_1.3.1 rstudioapi_0.13 > [25] bit_4.0.4 munsell_0.5.0 broom_0.7.2 > [28] compiler_4.0.3 modelr_0.1.8 pkgconfig_2.0.3 > [31] base64enc_0.1-3 htmltools_0.5.0 tidyselect_1.1.0 > [34] fansi_0.4.1 crayon_1.3.4 dbplyr_2.0.0 > [37] withr_2.3.0 grid_4.0.3 jsonlite_1.7.1 > [40] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0 > [43] magrittr_2.0.1 scales_1.1.1 cli_2.2.0 > [46] stringi_1.5.3 fs_1.5.0 xml2_1.3.2 > [49] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.5 > [52] IRkernel_1.1.1 tools_4.0.3 bit64_4.0.5 > [55] glue_1.4.2 hms_0.5.3 aws.s3_0.3.22 > [58] colorspace_2.0-0 rvest_0.3.6 pbdZMQ_0.3-3.1 > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)