[ 
https://issues.apache.org/jira/browse/ARROW-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-15627:
-------------------------------
    Priority: Minor  (was: Major)

> [R] Support unify_schemas for union datasets
> --------------------------------------------
>
>                 Key: ARROW-15627
>                 URL: https://issues.apache.org/jira/browse/ARROW-15627
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Will Jones
>            Priority: Minor
>              Labels: dataset
>             Fix For: 8.0.0
>
>
> Also out of discussion on [https://github.com/apache/arrow/issues/12371]
> You can unify schemas between different parquet files, but it seems like you 
> can't union together two (or more) datasets that have different schemas. This 
> is odd, because we do compute the unified schema onĀ [this 
> line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189],
>  only to later assert all the schemas are the same.
> {code:R}
> library(arrow)
> library(dplyr)
> df1 <- arrow_table(x = array(c(1, 2, 3)),
>                    y = array(c("a", "b", "c")))
> df2 <- arrow_table(x = array(c(4, 5)),
>                    z = array(c("d", "e")))
> df1 %>% write_dataset("example1", format="parquet")
> df2 %>% write_dataset("example2", format="parquet")
> ds1 <- open_dataset("example1", format="parquet")
> ds2 <- open_dataset("example2", format="parquet")
> # These don't work
> ds <- c(ds1, ds2) # c() actually does the same thing
> ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
> ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas 
> = TRUE)
> # This does
> ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), 
> format="parquet", unify_schemas = TRUE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to