nealrichardson commented on code in PR #14286:
URL: https://github.com/apache/arrow/pull/14286#discussion_r995765800


##########
r/tests/testthat/test-dplyr-join.R:
##########
@@ -338,3 +346,30 @@ test_that("arrow dplyr query can join two datasets", {
     }
   )
 })
+
+test_that("full joins handle keep", {
+  full_data_df <- tibble::tibble(
+    x = rep(c("a", "b"), each = 5),
+    y = rep(1:5, 2),
+    z = rep("zzz", 10),
+    index = 1:10
+  )
+  small_dataset_df <- tibble::tibble(
+    value = c(0.1, 0.2, 0.3, 0.4, 0.5),
+    x = c(rep("a", 3), rep("b", 2)),
+    y = 1:5,
+    z = 6:10
+  )
+  full_data <- Table$create(full_data_df)
+  small_dataset <- Table$create(small_dataset_df)
+
+  for (keep in c(TRUE, FALSE)) {
+    result <- full_join(small_dataset, full_data, by = c("y", "x"), keep = 
!!keep) %>%
+      arrange(index) %>%
+      collect()
+    expected <- full_join(small_dataset_df, full_data_df, by = c("y", "x"), 
keep = !!keep) %>%
+      arrange(index) %>%
+      collect()
+    expect_equal(result, expected)
+  }

Review Comment:
   You should be able to use compare_dplyr_binding here too. The data you're 
joining will get converted to Table automatically
   
   ```suggestion
   
     for (keep in c(TRUE, FALSE)) {
       compare_dplyr_binding(
         .input %>%
           full_join(full_data_df, by = c("y", "x"), keep = !!keep) %>%
           arrange(index) %>%
           collect(),
         small_dataset_df
       )
     }
   ```



##########
r/R/dplyr-join.R:
##########
@@ -83,7 +94,17 @@ full_join.arrow_dplyr_query <- function(x,
                                         suffix = c(".x", ".y"),
                                         ...,
                                         keep = FALSE) {
-  do_join(x, y, by, copy, suffix, ..., keep = keep, join_type = "FULL_OUTER")
+  query <- do_join(x, y, by, copy, suffix, ..., keep = keep, join_type = 
"FULL_OUTER")
+
+  # If we are doing a full outer join and not keeping the join keys of
+  # both sides, we need to coalesce. Otherwise, rows that exist in the
+  # RHS will have NAs for the join keys.
+  if (!keep) {
+    query$selected_columns <- post_join_projection(names(x), names(y), 
handle_join_by(by, x, y), suffix)
+    collapse.arrow_dplyr_query(query)

Review Comment:
   I don't think you need this collapse() here since do_join() collapses for 
you and you're just projecting after. (Moreover, since you aren't assigning the 
result here and return `query` below, the result of this collapse isn't being 
used.)
   
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to