Will Jones created ARROW-16085:
----------------------------------

             Summary: [R] Support unifying schemas for InMemoryDatasets
                 Key: ARROW-16085
                 URL: https://issues.apache.org/jira/browse/ARROW-16085
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 7.0.0
            Reporter: Will Jones
             Fix For: 8.0.0


 

The following fails:

{code:R}
sub_df1 <- Table$create(
  x = Array$create(c(1, 2, 3)),
  y = Array$create(c("a", "b", "c"))
)
sub_df2 <- Table$create(
  x = Array$create(c(4, 5)),
  z = Array$create(c("d", "e"))
)

ds1 <- InMemoryDataset$create(sub_df1)
ds2 <- InMemoryDataset$create(sub_df2)
ds <- c(ds1, ds2)
actual <- ds %>% collect()
{code}

{code}
Type error: yielded batch had schema x: double
y: string which did not match InMemorySource's: x: double
y: string
z: string
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:541 
 child_.Next()
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:152 
 value_.status()
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:180 
 maybe_element
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/dataset/scanner.cc:840
  fragments_it.ToVector()
{code}

If we fixed this, we could implement a function that does for Tables what 
{{dplyr::bind_rows}} does for Tibbles:

{code:R}
concat_tables <- function(..., schema = NULL) {
  tables <- list2(...)

  dataset <- open_dataset(map(tables, InMemoryDataset$create), schema = schema)

  dplyr::collect(dataset, as_data_frame = FALSE)
}
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to