Andy Teucher created ARROW-16783: ------------------------------------ Summary: [R] write_dataset fails with an uninformative message when duplicated column names Key: ARROW-16783 URL: https://issues.apache.org/jira/browse/ARROW-16783 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: Andy Teucher
{{write_dataset()}} fails when the object being written has duplicated column names. This is probably reasonable behaviour, but the error message is misleading: {code:r} library(arrow, warn.conflicts = FALSE) df <- data.frame( id = c("a", "b", "c"), x = 1:3, x = 4:6, check.names = FALSE ) write_dataset(df, "df.parquet") #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame, not "data.frame" {code} [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160], so any error from {{as_adq()}} is swallowed and the error emitted is about the class of the object. The real error comes from here: {code:r} arrow:::as_adq(df) #> Error in `arrow_dplyr_query()`: #> ! Duplicated field names #> ✖ The following field names were found more than once in the data: "x" {code} I'm not sure what your preferred fix is here... two options that come to mind are: 1. Explicitly check for compatible classes before calling {{as_adq()}} instead of using {{tryCatch()}} OR 2. Check for duplicate column names before the {{tryCatc}} block My thought is that option 1 is better, as option 2 means that checking for duplicates would happen twice (once inside {{write_dataset()}} and once again inside {{{}as_adq(){}}}). I'm happy to work a fix if you like! -- This message was sent by Atlassian Jira (v8.20.7#820007)