[ https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653169#comment-17653169 ]
Will Jones commented on ARROW-18195: ------------------------------------ Thank you for all the reproductions. I zeroed in on one simple one and was able to reproduce in C++. Additional observations: {code:R} library(dplyr, warn.conflicts = FALSE) library(arrow, warn.conflicts = FALSE) # Condition has NA and more than 64 values # Expression generated internally: # case_when({1=x}, 1) test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64))) test_arrow4 = arrow_table(test_df4) test_arrow4 %>% mutate(y = case_when(x ~ 1L)) %>% collect() %>% tail() #> # A tibble: 6 × 2 #> x y #> <lgl> <int> #> 1 TRUE 1 #> 2 TRUE 1 #> 3 TRUE 1 #> 4 TRUE 1 #> 5 TRUE 1 #> 6 TRUE NA # It seems to be coming from the next clause, which defaults to NA # Expression generated internally: # case_when({1=x, 2=true}, 1, 2) test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64))) test_arrow4 = arrow_table(test_df4) test_arrow4 %>% mutate(y = case_when(x ~ 1L, TRUE ~ 2L)) %>% collect() %>% tail() #> # A tibble: 6 × 2 #> x y #> <lgl> <int> #> 1 TRUE 1 #> 2 TRUE 1 #> 3 TRUE 1 #> 4 TRUE 1 #> 5 TRUE 1 #> 6 TRUE 2 # Applies also to vectors test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64)), left = rep(1L, 65), right = rep(2L, 65)) test_arrow4 = arrow_table(test_df4) test_arrow4 %>% mutate(y = case_when(x ~ left, TRUE ~ right)) %>% collect() %>% tail() #> # A tibble: 6 × 4 #> x left right y #> <lgl> <int> <int> <int> #> 1 TRUE 1 2 1 #> 2 TRUE 1 2 1 #> 3 TRUE 1 2 1 #> 4 TRUE 1 2 1 #> 5 TRUE 1 2 1 #> 6 TRUE 1 2 2 # It does seem the 65th and onward element become the else value for no reason lapply(c(65, 68, 127, 140), function(len) { test_df4 = tibble::tibble(x = c(NA, rep(TRUE, len - 1))) test_arrow4 = arrow_table(test_df4) y <- test_arrow4 %>% mutate(y = case_when(x ~ 1L)) %>% collect() %>% .$y which(is.na(y)) }) #> [[1]] #> [1] 1 65 #> #> [[2]] #> [1] 1 65 66 67 68 #> #> [[3]] #> [1] 1 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 #> [20] 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 #> [39] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 #> [58] 121 122 123 124 125 126 127 #> #> [[4]] #> [1] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 #> [20] 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 #> [39] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 #> [58] 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 {code} <sup>Created on 2022-12-30 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> > [R][C++] Final value returned by case_when is NA when input has 64 or more > values and 1 or more NAs > --------------------------------------------------------------------------------------------------- > > Key: ARROW-18195 > URL: https://issues.apache.org/jira/browse/ARROW-18195 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Affects Versions: 10.0.0 > Reporter: Lee Mendelowitz > Assignee: Will Jones > Priority: Critical > Labels: pull-request-available > Fix For: 11.0.0 > > Attachments: test_issue.R > > Time Spent: 20m > Remaining Estimate: 0h > > There appears to be a bug when processing an Arrow table with NA values and > using `dplyr::case_when`. A reproducible example is below: the output from > arrow table processing does not match the output when processing a tibble. If > the NA's are removed from the dataframe, then the outputs match. > {noformat} > ``` r > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(assertthat) > play_results = c('single', 'double', 'triple', 'home_run') > nrows = 1000 > # Change frac_na to 0, and the result error disappears. > frac_na = 0.05 > # Create a test dataframe with NA values > test_df = tibble( > play_result = sample(play_results, nrows, replace = TRUE) > ) %>% > mutate( > play_result = ifelse(runif(nrows) < frac_na, NA_character_, > play_result) > ) > > test_arrow = arrow_table(test_df) > process_plays = function(df) { > df %>% > mutate( > avg = case_when( > play_result == 'single' ~ 1, > play_result == 'double' ~ 1, > play_result == 'triple' ~ 1, > play_result == 'home_run' ~ 1, > is.na(play_result) ~ NA_real_, > TRUE ~ 0 > ) > ) %>% > count(play_result, avg) %>% > arrange(play_result) > } > # Compare arrow_table reuslt to tibble result > result_tibble = process_plays(test_df) > result_arrow = process_plays(test_arrow) %>% collect() > assertthat::assert_that(identical(result_tibble, result_arrow)) > #> Error: result_tibble not identical to result_arrow > ``` > <sup>Created on 2022-10-29 with [reprex > v2.0.2](https://reprex.tidyverse.org)</sup> > {noformat} > I have reproduced this issue both on Mac OS and Ubuntu 20.04. > > {noformat} > ``` > r$> sessionInfo() > R version 4.2.1 (2022-06-23) > Platform: aarch64-apple-darwin21.5.0 (64-bit) > Running under: macOS Monterey 12.5.1 > Matrix products: default > BLAS: /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib > LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > attached base packages: > [1] stats graphics grDevices datasets utils methods base > other attached packages: > [1] assertthat_0.2.1 arrow_10.0.0 dplyr_1.0.10 > loaded via a namespace (and not attached): > [1] compiler_4.2.1 pillar_1.8.1 highr_0.9 R.methodsS3_1.8.2 > R.utils_2.12.0 tools_4.2.1 bit_4.0.4 digest_0.6.29 > [9] evaluate_0.15 lifecycle_1.0.1 tibble_3.1.8 R.cache_0.16.0 > pkgconfig_2.0.3 rlang_1.0.5 reprex_2.0.2 DBI_1.1.2 > [17] cli_3.3.0 rstudioapi_0.13 yaml_2.3.5 xfun_0.31 > fastmap_1.1.0 withr_2.5.0 styler_1.8.0 knitr_1.39 > [25] generics_0.1.3 fs_1.5.2 vctrs_0.4.1 bit64_4.0.5 > tidyselect_1.1.2 glue_1.6.2 R6_2.5.1 processx_3.5.3 > [33] fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4 callr_3.7.0 > clipr_0.8.0 magrittr_2.0.3 ellipsis_0.3.2 ps_1.7.0 > [41] htmltools_0.5.3 renv_0.16.0 utf8_1.2.2 R.oo_1.25.0 > ``` > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)