[jira] [Comment Edited] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

Jira Thu, 14 Oct 2021 06:37:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428796#comment-17428796
 ]


Dragoș Moldovan-Grünfeld edited comment on ARROW-13887 at 10/14/21, 1:36 PM:
-----------------------------------------------------------------------------

Another option might be to detect if the user is somehow passing col_names and 
print a message letting them know they should check the CSV does not have 
headers. 

read::read_csv() has a similar issue, the difference being that in the case of 
a mismatch they coerce the output column to string. 

{code:r}
read_csv("~/Desktop/share_data2.csv", 
         col_names = c("col1", "col2"))
{code}
{code:r}
Rows: 5 Columns: 2                                                              
                                       
── Column specification ──────────────────────────────
Delimiter: ","
chr (2): col1, col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 5 × 2
  col1    col2          
  <chr>   <chr>         
1 company another_string
2 AMZN    AMZN          
3 GOOG    GOOG          
4 BKNG    BKNG          
5 TSLA    TSLA   
{code}

{code:r}
read_csv("~/Desktop/share_data.csv",
         col_names = c("col1", "col2"))
{code}

{code:r}
Rows: 5 Columns: 2                                                              
                                       
── Column specification ──────────────────────────────
Delimiter: ","
chr (2): col1, col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 5 × 2
  col1    col2   
  <chr>   <chr>  
1 company price  
2 AMZN    3463.12
3 GOOG    2884.38
4 BKNG    2300.46
5 TSLA    732.39 
{code}

When we specifically ask for a numeric column, but the file has headers, the 
cell that doesn't match the indicated type is read in as NA and a _warning_ is 
displayed.

{code:r}
read_csv("~/Desktop/share_data.csv",
         col_names = c("col1", "col2"),
         col_types = "cn")
{code}

{code:r}
# A tibble: 5 × 2                                                               
                                       
  col1     col2
  <chr>   <dbl>
1 company   NA 
2 AMZN    3463.
3 GOOG    2884.
4 BKNG    2300.
5 TSLA     732.
Warning message:
One or more parsing issues, see `problems()` for details 
{code}


was (Author: dragosmg):
Another option might be to detect if the user is somehow passing col_names and 
print a message letting them know they should check the CSV does not have 
headers. 

read::read_csv() has a similar issue, the difference being that in the case of 
a mismatch they coerce the output column to string. 

{code:r}
read_csv("~/Desktop/share_data2.csv", 
         col_names = c("col1", "col2"))
{code}
{code:r}
Rows: 5 Columns: 2                                                              
                                       
── Column specification ──────────────────────────────
Delimiter: ","
chr (2): col1, col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 5 × 2
  col1    col2          
  <chr>   <chr>         
1 company another_string
2 AMZN    AMZN          
3 GOOG    GOOG          
4 BKNG    BKNG          
5 TSLA    TSLA   
{code}

{code:r}
read_csv("~/Desktop/share_data.csv",
         col_names = c("col1", "col2"))
{code}

{code:r}
Rows: 5 Columns: 2                                                              
                                       
── Column specification ──────────────────────────────
Delimiter: ","
chr (2): col1, col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 5 × 2
  col1    col2   
  <chr>   <chr>  
1 company price  
2 AMZN    3463.12
3 GOOG    2884.38
4 BKNG    2300.46
5 TSLA    732.39 
{code}

{code:r}
read_csv("~/Desktop/share_data.csv",
         col_names = c("col1", "col2"),
         col_types = "cn")
{code}

{code:r}
# A tibble: 5 × 2                                                               
                                       
  col1     col2
  <chr>   <dbl>
1 company   NA 
2 AMZN    3463.
3 GOOG    2884.
4 BKNG    2300.
5 TSLA     732.
Warning message:
One or more parsing issues, see `problems()` for details 
{code}

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13887
>                 URL: https://issues.apache.org/jira/browse/ARROW-13887
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: good-first-issue
>             Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, &value)
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

Reply via email to