[jira] [Commented] (ARROW-15838) [C++] Key column behavior in joins

Weston Pace (Jira) Mon, 18 Jul 2022 11:15:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568150#comment-17568150
 ]


Weston Pace commented on ARROW-15838:
-------------------------------------

It appears that pyarrow is actually doing the coalesce (which is the correct 
thing) while R isn't (hence ARROW-16897).  I agree that it would be nice to 
integrate this into the join itself.  [~zagto] had a PR to do this but it was 
built on top of the newer swiss join which, at the time, was not merged, so it 
was put on hold.  It's probably too late to try and pull that in for 9.0.0 but 
I think this is something we can tackle as part of 10.0.0

> [C++] Key column behavior in joins
> ----------------------------------
>
>                 Key: ARROW-15838
>                 URL: https://issues.apache.org/jira/browse/ARROW-15838
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>             Fix For: 10.0.0
>
>
> By default in dplyr (and possibly in pandas too?) coalesces the key column 
> for full joins to be the (non-null) values from both key columns:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),                  
> )                      
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )                      
> right_tab <- Table$create(right)
> > left %>% full_join(right) 
> Joining, by = "key"
> # A tibble: 3 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     1     0    NA
> 2     2     1     0
> 3     3    NA     1
> > left_tab %>% full_join(right_tab) %>% collect()
> # A tibble: 3 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2     1     0    NA
> 3    NA    NA     1
> {code}
> And right join, we would expect the key from the right table to be in the 
> result, but we get the key from the left instead:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),                  
> )                      
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )                      
> right_tab <- Table$create(right)
> > left %>% right_join(right)
> Joining, by = "key"
> # A tibble: 2 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2     3    NA     1
> > left_tab %>% right_join(right_tab) %>% collect()
> # A tibble: 2 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2    NA    NA     1
> {code}
> Additionally, we should be able to keep both key columns with an option (cf 
> https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15838) [C++] Key column behavior in joins

Reply via email to