[ https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428796#comment-17428796 ]
Dragoș Moldovan-Grünfeld edited comment on ARROW-13887 at 10/14/21, 1:36 PM: ----------------------------------------------------------------------------- Another option might be to detect if the user is somehow passing col_names and print a message letting them know they should check the CSV does not have headers. read::read_csv() has a similar issue, the difference being that in the case of a mismatch they coerce the output column to string. {code:r} read_csv("~/Desktop/share_data2.csv", col_names = c("col1", "col2")) {code} {code:r} Rows: 5 Columns: 2 ── Column specification ────────────────────────────── Delimiter: "," chr (2): col1, col2 ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 5 × 2 col1 col2 <chr> <chr> 1 company another_string 2 AMZN AMZN 3 GOOG GOOG 4 BKNG BKNG 5 TSLA TSLA {code} {code:r} read_csv("~/Desktop/share_data.csv", col_names = c("col1", "col2")) {code} {code:r} Rows: 5 Columns: 2 ── Column specification ────────────────────────────── Delimiter: "," chr (2): col1, col2 ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 5 × 2 col1 col2 <chr> <chr> 1 company price 2 AMZN 3463.12 3 GOOG 2884.38 4 BKNG 2300.46 5 TSLA 732.39 {code} When we specifically ask for a numeric column, but the file has headers, the cell that doesn't match the indicated type is read in as NA and a _warning_ is displayed. {code:r} read_csv("~/Desktop/share_data.csv", col_names = c("col1", "col2"), col_types = "cn") {code} {code:r} # A tibble: 5 × 2 col1 col2 <chr> <dbl> 1 company NA 2 AMZN 3463. 3 GOOG 2884. 4 BKNG 2300. 5 TSLA 732. Warning message: One or more parsing issues, see `problems()` for details {code} was (Author: dragosmg): Another option might be to detect if the user is somehow passing col_names and print a message letting them know they should check the CSV does not have headers. read::read_csv() has a similar issue, the difference being that in the case of a mismatch they coerce the output column to string. {code:r} read_csv("~/Desktop/share_data2.csv", col_names = c("col1", "col2")) {code} {code:r} Rows: 5 Columns: 2 ── Column specification ────────────────────────────── Delimiter: "," chr (2): col1, col2 ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 5 × 2 col1 col2 <chr> <chr> 1 company another_string 2 AMZN AMZN 3 GOOG GOOG 4 BKNG BKNG 5 TSLA TSLA {code} {code:r} read_csv("~/Desktop/share_data.csv", col_names = c("col1", "col2")) {code} {code:r} Rows: 5 Columns: 2 ── Column specification ────────────────────────────── Delimiter: "," chr (2): col1, col2 ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 5 × 2 col1 col2 <chr> <chr> 1 company price 2 AMZN 3463.12 3 GOOG 2884.38 4 BKNG 2300.46 5 TSLA 732.39 {code} {code:r} read_csv("~/Desktop/share_data.csv", col_names = c("col1", "col2"), col_types = "cn") {code} {code:r} # A tibble: 5 × 2 col1 col2 <chr> <dbl> 1 company NA 2 AMZN 3463. 3 GOOG 2884. 4 BKNG 2300. 5 TSLA 732. Warning message: One or more parsing issues, see `problems()` for details {code} > [R] Capture error produced when reading in CSV file with headers and using a > schema, and add suggestion > ------------------------------------------------------------------------------------------------------- > > Key: ARROW-13887 > URL: https://issues.apache.org/jira/browse/ARROW-13887 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Nicola Crane > Assignee: Dragoș Moldovan-Grünfeld > Priority: Major > Labels: good-first-issue > Fix For: 6.0.0 > > > When reading in a CSV with headers, and also using a schema, we get an error > as the code tries to read in the header as a line of data. > {code:java} > share_data <- tibble::tibble( > company = c("AMZN", "GOOG", "BKNG", "TSLA"), > price = c(3463.12, 2884.38, 2300.46, 732.39) > ) > readr::write_csv(share_data, file = "share_data.csv") > share_schema <- schema( > company = utf8(), > price = float64() > ) > read_csv_arrow("share_data.csv", schema = share_schema) > {code} > {code:java} > Error: Invalid: In CSV column #1: CSV conversion error to double: invalid > value 'price' > /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, > size, quoted, &value) > /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status > /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 > parser.VisitColumn(col_index, visit) {code} > The correct thing here would have been for the user to supply the argument > {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from > the error message returned from C++. We should capture the error and instead > supply our own error message using {{rlang::abort}} which informs the user of > the error and then suggests what they can do to prevent it. > > For similar examples (and their associated PRs) see > {color:#1d1c1d}ARROW-11766, and ARROW-12791{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)