[ https://issues.apache.org/jira/browse/ARROW-16897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571887#comment-17571887 ]
Krisztian Szucs commented on ARROW-16897: ----------------------------------------- Postponing to 10.0 since it depends on several other unresolved issues. > [R][C++] Full join on Arrow objects is incorrect > ------------------------------------------------ > > Key: ARROW-16897 > URL: https://issues.apache.org/jira/browse/ARROW-16897 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Affects Versions: 8.0.0 > Environment: Linux > Reporter: Oliver Reiter > Assignee: Weston Pace > Priority: Critical > Labels: joins, query-engine > Fix For: 9.0.0 > > > Hello, > I am trying to do a full join on a dataset. It produces the correct number of > observations, but not the correct result (the resulting data.frame is just > filled up with NA-rows). > My use case: I want to include the 'full' year range for every factor value: > {code:java} > library(data.table) > library(arrow) > library(dplyr) > year_range <- 2000:2019 > group_n <- 100 > N <- 1000 ## the resulting data should have 100 groups * 20 years > dt <- data.table(value = rnorm(N), > group = rep(paste0("g", 1:group_n), length.out = N)) > ## there are only observations for some years in every group > dt[, year := sample(year_range, size = N / group_n), by = .(group)] > dt[group == "g1", ] > ## this would be the 'full' data.table > group_years <- data.table(group = rep(unique(dt$group), each = 20), > year = rep(year_range, times = 10)) > group_years[group == "g1", ] > write_dataset(dt, path = "parquet_db") > db <- open_dataset(sources = "parquet_db") > ## full_join using data.table -> expected result > db_full <- merge(dt, group_years, > by = c("group", "year"), > all = TRUE) > setorder(db_full, group, year) > db_full[group == "g1", ] > ## try to do the full_join with arrow -> incorrect result > db_full_arrow <- db |> > full_join(group_years, by = c("group", "year")) |> > collect() |> > setDT() > setorder(db_full_arrow, group, year) > db_full_arrow[group == "g1", ] > ## or: convert data.table to arrow_table beforehand -> incorrect result > group_years_arrow <- group_years |> > as_arrow_table() > db_full_arrow <- db |> > full_join(group_years_arrow, by = c("group", "year")) |> > collect() |> > setDT() > setorder(db_full_arrow, group, year) > db_full_arrow[group == "g1", ]{code} > The [documentation|https://arrow.apache.org/docs/r/] says equality joins are > supported, which should hold also for `full_join` I guess? > Thanks for your time and work! > > Oliver -- This message was sent by Atlassian Jira (v8.20.10#820010)