[ https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428958#comment-17428958 ]
Weston Pace edited comment on ARROW-13887 at 10/14/21, 6:15 PM: ---------------------------------------------------------------- > This implies we cannot rely on capturing the C++ error message and offering a > more informative option in R as sometimes the error might not be triggered > (in the case of a CSV where all the columns are strings / characters). > > The solution might be to somehow assess whether the CSV file has headers or > not - which I do not think it is possible. I agree this case is probably inevitable. I'm not sure we should stress too much about it. FYI, in pandas: {noformat} >>> x = "books,authors\ncirce,miller" >>> pd.read_csv(io.StringIO(x)) books authors 0 circe miller >>> pd.read_csv(io.StringIO(x), names=['books', 'authors']) books authors 0 books authors 1 circe miller {noformat} Pandas' read_csv method offers up this long, complicated parameter as a solution: {noformat} header - int, list of int, default ‘infer’ Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file. {noformat} was (Author: westonpace): > This implies we cannot rely on capturing the C++ error message and offering a > more informative option in R as sometimes the error might not be triggered > (in the case of a CSV where all the columns are strings / characters). > > The solution might be to somehow assess whether the CSV file has headers or > not - which I do not think it is possible. I agree this case is probably inevitable. I'm not sure we should stress too much about it. FYI, in pandas: {noformat} >>> x = "books,authors\ncirce,miller" >>> pd.read_csv(io.StringIO(x)) books authors 0 circe miller >>> pd.read_csv(io.StringIO(x), names=['books', 'authors']) books authors 0 books authors 1 circe miller {noformat} Pandas' read_csv method offers up this long, complicated parameter as a solution: {noformat} headerint, list of int, default ‘infer’ Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file. {noformat} > [R] Capture error produced when reading in CSV file with headers and using a > schema, and add suggestion > ------------------------------------------------------------------------------------------------------- > > Key: ARROW-13887 > URL: https://issues.apache.org/jira/browse/ARROW-13887 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Nicola Crane > Assignee: Dragoș Moldovan-Grünfeld > Priority: Major > Labels: good-first-issue > Fix For: 7.0.0 > > > When reading in a CSV with headers, and also using a schema, we get an error > as the code tries to read in the header as a line of data. > {code:java} > share_data <- tibble::tibble( > company = c("AMZN", "GOOG", "BKNG", "TSLA"), > price = c(3463.12, 2884.38, 2300.46, 732.39) > ) > readr::write_csv(share_data, file = "share_data.csv") > share_schema <- schema( > company = utf8(), > price = float64() > ) > read_csv_arrow("share_data.csv", schema = share_schema) > {code} > {code:java} > Error: Invalid: In CSV column #1: CSV conversion error to double: invalid > value 'price' > /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, > size, quoted, &value) > /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status > /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 > parser.VisitColumn(col_index, visit) {code} > The correct thing here would have been for the user to supply the argument > {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from > the error message returned from C++. We should capture the error and instead > supply our own error message using {{rlang::abort}} which informs the user of > the error and then suggests what they can do to prevent it. > > For similar examples (and their associated PRs) see > {color:#1d1c1d}ARROW-11766, and ARROW-12791{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)