[ https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566909#comment-17566909 ]
Ian Cook commented on ARROW-15838: ---------------------------------- IIUC, I think it should be possible to resolve this with a change only to the R bindings, by using the `coalesce` function. I don't think we need changes in Acero to resolve this, so it does not really depend on ARROW-15957. > [C++] Key column behavior in joins > ---------------------------------- > > Key: ARROW-15838 > URL: https://issues.apache.org/jira/browse/ARROW-15838 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Jonathan Keane > Priority: Major > Fix For: 10.0.0 > > > By default in dplyr (and possibly in pandas too?) coalesces the key column > for full joins to be the (non-null) values from both key columns: > {code} > > left <- tibble::tibble( > key = c(1, 2), > A = c(0, 1), > ) > left_tab <- Table$create(left) > > right <- tibble::tibble( > key = c(2, 3), > B = c(0, 1), > ) > right_tab <- Table$create(right) > > left %>% full_join(right) > Joining, by = "key" > # A tibble: 3 × 3 > key A B > <dbl> <dbl> <dbl> > 1 1 0 NA > 2 2 1 0 > 3 3 NA 1 > > left_tab %>% full_join(right_tab) %>% collect() > # A tibble: 3 × 3 > key A B > <dbl> <dbl> <dbl> > 1 2 1 0 > 2 1 0 NA > 3 NA NA 1 > {code} > And right join, we would expect the key from the right table to be in the > result, but we get the key from the left instead: > {code} > > left <- tibble::tibble( > key = c(1, 2), > A = c(0, 1), > ) > left_tab <- Table$create(left) > > right <- tibble::tibble( > key = c(2, 3), > B = c(0, 1), > ) > right_tab <- Table$create(right) > > left %>% right_join(right) > Joining, by = "key" > # A tibble: 2 × 3 > key A B > <dbl> <dbl> <dbl> > 1 2 1 0 > 2 3 NA 1 > > left_tab %>% right_join(right_tab) %>% collect() > # A tibble: 2 × 3 > key A B > <dbl> <dbl> <dbl> > 1 2 1 0 > 2 NA NA 1 > {code} > Additionally, we should be able to keep both key columns with an option (cf > https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32) > -- This message was sent by Atlassian Jira (v8.20.10#820010)