[ 
https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566909#comment-17566909
 ] 

Ian Cook edited comment on ARROW-15838 at 7/14/22 4:17 PM:
-----------------------------------------------------------

IIUC, I think it should be possible to resolve this with a change only to the R 
bindings, by using the {{coalesce}} function. I don't think we need changes in 
Acero to resolve this, so it does not really depend on ARROW-15957.


was (Author: icook):
IIUC, I think it should be possible to resolve this with a change only to the R 
bindings, by using the `coalesce` function. I don't think we need changes in 
Acero to resolve this, so it does not really depend on ARROW-15957.

> [C++] Key column behavior in joins
> ----------------------------------
>
>                 Key: ARROW-15838
>                 URL: https://issues.apache.org/jira/browse/ARROW-15838
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>             Fix For: 10.0.0
>
>
> By default in dplyr (and possibly in pandas too?) coalesces the key column 
> for full joins to be the (non-null) values from both key columns:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),                  
> )                      
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )                      
> right_tab <- Table$create(right)
> > left %>% full_join(right) 
> Joining, by = "key"
> # A tibble: 3 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     1     0    NA
> 2     2     1     0
> 3     3    NA     1
> > left_tab %>% full_join(right_tab) %>% collect()
> # A tibble: 3 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2     1     0    NA
> 3    NA    NA     1
> {code}
> And right join, we would expect the key from the right table to be in the 
> result, but we get the key from the left instead:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),                  
> )                      
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )                      
> right_tab <- Table$create(right)
> > left %>% right_join(right)
> Joining, by = "key"
> # A tibble: 2 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2     3    NA     1
> > left_tab %>% right_join(right_tab) %>% collect()
> # A tibble: 2 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2    NA    NA     1
> {code}
> Additionally, we should be able to keep both key columns with an option (cf 
> https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to