[jira] [Updated] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

Andy Teucher (Jira) Tue, 07 Jun 2022 13:34:07 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andy Teucher updated ARROW-16783:
---------------------------------
    Description: 
{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.

OR

2. Check for duplicate column names before the {{tryCatch}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!

  was:
{{write_dataset()}} fails when the object being written has duplicated column 
names. This is probably reasonable behaviour, but the error message is 
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
 so any error from {{as_adq()}} is swallowed and the error emitted is about the 
class of the object.

The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind 
are:

1. Explicitly check for compatible classes before calling {{as_adq()}} instead 
of using {{tryCatch()}}

OR

2. Check for duplicate column names before the {{tryCatc}} block

My thought is that option 1 is better, as option 2 means that checking for 
duplicates would happen twice (once inside {{write_dataset()}} and once again 
inside {{{}as_adq(){}}}).

I'm happy to work a fix if you like!


> [R] write_dataset fails with an uninformative message when duplicated column 
> names
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-16783
>                 URL: https://issues.apache.org/jira/browse/ARROW-16783
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 8.0.0
>            Reporter: Andy Teucher
>            Priority: Major
>
> {{write_dataset()}} fails when the object being written has duplicated column 
> names. This is probably reasonable behaviour, but the error message is 
> misleading:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> df <- data.frame(
>   id = c("a", "b", "c"),
>   x = 1:3, 
>   x = 4:6,
>   check.names = FALSE
> )
> write_dataset(df, "df")
> #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, 
> or data.frame, not "data.frame"
> {code}
> [{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}} 
> statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
>  so any error from {{as_adq()}} is swallowed and the error emitted is about 
> the class of the object.
> The real error comes from here:
> {code:r}
> arrow:::as_adq(df)
> #> Error in `arrow_dplyr_query()`:
> #> ! Duplicated field names
> #> ✖ The following field names were found more than once in the data: "x"
> {code}
> I'm not sure what your preferred fix is here... two options that come to mind 
> are:
> 1. Explicitly check for compatible classes before calling {{as_adq()}} 
> instead of using {{tryCatch()}}, allowing `as_adq()` to emit its own errors.
> OR
> 2. Check for duplicate column names before the {{tryCatch}} block
> My thought is that option 1 is better, as option 2 means that checking for 
> duplicates would happen twice (once inside {{write_dataset()}} and once again 
> inside {{{}as_adq(){}}}).
> I'm happy to work a fix if you like!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16783) [R] write_dataset fails with an uninformative message when duplicated column names

Reply via email to