[ https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Teucher updated ARROW-16783: --------------------------------- Description: {{write_dataset()}} fails when the object being written has duplicated column names. This is probably reasonable behaviour, but the error message is misleading: {code:r} library(arrow, warn.conflicts = FALSE) df <- data.frame( id = c("a", "b", "c"), x = 1:3, x = 4:6, check.names = FALSE ) write_dataset(df, "df") #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame, not "data.frame" {code} [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160], so any error from {{as_adq()}} is swallowed and the error emitted is about the class of the object. The real error comes from here: {code:r} arrow:::as_adq(df) #> Error in `arrow_dplyr_query()`: #> ! Duplicated field names #> ✖ The following field names were found more than once in the data: "x" {code} I'm not sure what your preferred fix is here... two options that come to mind are: 1. Explicitly check for compatible classes before calling {{as_adq()}} instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors. OR 2. Check for duplicate column names before the {{tryCatch}} block My thought is that option 1 is better, as option 2 means that checking for duplicates would happen twice (once inside {{write_dataset()}} and once again inside {{{}as_adq(){}}}). I'm happy to work a fix if you like! was: {{write_dataset()}} fails when the object being written has duplicated column names. This is probably reasonable behaviour, but the error message is misleading: {code:r} library(arrow, warn.conflicts = FALSE) df <- data.frame( id = c("a", "b", "c"), x = 1:3, x = 4:6, check.names = FALSE ) write_dataset(df, "df") #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame, not "data.frame" {code} [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160], so any error from {{as_adq()}} is swallowed and the error emitted is about the class of the object. The real error comes from here: {code:r} arrow:::as_adq(df) #> Error in `arrow_dplyr_query()`: #> ! Duplicated field names #> ✖ The following field names were found more than once in the data: "x" {code} I'm not sure what your preferred fix is here... two options that come to mind are: 1. Explicitly check for compatible classes before calling {{as_adq()}} instead of using {{tryCatch()}} OR 2. Check for duplicate column names before the {{tryCatc}} block My thought is that option 1 is better, as option 2 means that checking for duplicates would happen twice (once inside {{write_dataset()}} and once again inside {{{}as_adq(){}}}). I'm happy to work a fix if you like! > [R] write_dataset fails with an uninformative message when duplicated column > names > ---------------------------------------------------------------------------------- > > Key: ARROW-16783 > URL: https://issues.apache.org/jira/browse/ARROW-16783 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Affects Versions: 8.0.0 > Reporter: Andy Teucher > Priority: Major > > {{write_dataset()}} fails when the object being written has duplicated column > names. This is probably reasonable behaviour, but the error message is > misleading: > {code:r} > library(arrow, warn.conflicts = FALSE) > df <- data.frame( > id = c("a", "b", "c"), > x = 1:3, > x = 4:6, > check.names = FALSE > ) > write_dataset(df, "df") > #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, > or data.frame, not "data.frame" > {code} > [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} > statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160], > so any error from {{as_adq()}} is swallowed and the error emitted is about > the class of the object. > The real error comes from here: > {code:r} > arrow:::as_adq(df) > #> Error in `arrow_dplyr_query()`: > #> ! Duplicated field names > #> ✖ The following field names were found more than once in the data: "x" > {code} > I'm not sure what your preferred fix is here... two options that come to mind > are: > 1. Explicitly check for compatible classes before calling {{as_adq()}} > instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors. > OR > 2. Check for duplicate column names before the {{tryCatch}} block > My thought is that option 1 is better, as option 2 means that checking for > duplicates would happen twice (once inside {{write_dataset()}} and once again > inside {{{}as_adq(){}}}). > I'm happy to work a fix if you like! -- This message was sent by Atlassian Jira (v8.20.7#820007)