[jira] [Comment Edited] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

Weston Pace (Jira) Thu, 14 Oct 2021 11:16:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428958#comment-17428958
 ]


Weston Pace edited comment on ARROW-13887 at 10/14/21, 6:15 PM:
----------------------------------------------------------------

> This implies we cannot rely on capturing the C++ error message and offering a 
> more informative option in R as sometimes the error might not be triggered 
> (in the case of a CSV where all the columns are strings / characters).
>
> The solution might be to somehow assess whether the CSV file has headers or 
> not - which I do not think it is possible. 

I agree this case is probably inevitable.  I'm not sure we should stress too 
much about it.  FYI, in pandas:

{noformat}
>>> x = "books,authors\ncirce,miller"
>>> pd.read_csv(io.StringIO(x))
   books authors
0  circe  miller
>>> pd.read_csv(io.StringIO(x), names=['books', 'authors'])
   books  authors
0  books  authors
1  circe   miller
{noformat}

Pandas' read_csv method offers up this long, complicated parameter as a 
solution:

{noformat}
header - int, list of int, default ‘infer’

    Row number(s) to use as the column names, and the start of the data.
Default behavior is to infer the column names: if no names are passed the 
behavior is
identical to header=0 and column names are inferred from the first line of the 
file, if
column names are passed explicitly then the behavior is identical to 
header=None.
Explicitly pass header=0 to be able to replace existing names. The header can 
be a list of
integers that specify row locations for a multi-index on the columns e.g. 
[0,1,3]. Intervening
rows that are not specified will be skipped (e.g. 2 in this example is 
skipped). Note that this
parameter ignores commented lines and empty lines if skip_blank_lines=True, so 
header=0
denotes the first line of data rather than the first line of the file.
{noformat}


was (Author: westonpace):
> This implies we cannot rely on capturing the C++ error message and offering a 
> more informative option in R as sometimes the error might not be triggered 
> (in the case of a CSV where all the columns are strings / characters).
>
> The solution might be to somehow assess whether the CSV file has headers or 
> not - which I do not think it is possible. 

I agree this case is probably inevitable.  I'm not sure we should stress too 
much about it.  FYI, in pandas:

{noformat}
>>> x = "books,authors\ncirce,miller"
>>> pd.read_csv(io.StringIO(x))
   books authors
0  circe  miller
>>> pd.read_csv(io.StringIO(x), names=['books', 'authors'])
   books  authors
0  books  authors
1  circe   miller
{noformat}

Pandas' read_csv method offers up this long, complicated parameter as a 
solution:

{noformat}
headerint, list of int, default ‘infer’

    Row number(s) to use as the column names, and the start of the data. 
Default behavior is to infer the column names: if no names are passed the 
behavior is identical to header=0 and column names are inferred from the first 
line of the file, if column names are passed explicitly then the behavior is 
identical to header=None. Explicitly pass header=0 to be able to replace 
existing names. The header can be a list of integers that specify row locations 
for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not 
specified will be skipped (e.g. 2 in this example is skipped). Note that this 
parameter ignores commented lines and empty lines if skip_blank_lines=True, so 
header=0 denotes the first line of data rather than the first line of the file.
{noformat}

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13887
>                 URL: https://issues.apache.org/jira/browse/ARROW-13887
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: good-first-issue
>             Fix For: 7.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, &value)
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

Reply via email to