[ 
https://issues.apache.org/jira/browse/ARROW-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5747:
-----------------------------------
    Description: 
While working on ARROW-5500, I found a number of issues around the CSV parse 
options {{header_rows}}:
 * If header_rows is 0, [the reader 
errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150]
 * It's not possible to supply your own column names, as [this 
TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149]
 notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is 
enough as long as header_rows == 0 doesn't error, but then you can't naturally 
specify column types in the convert options because that takes a map of column 
name to type.
 * If header_rows is > 1, every cell gets turned into a column name, so if 
header_rows == 2, you get twice the number of column names as columns. This 
doesn't error, but it leads to unexpected results.

IMO a better interface would be to have a {{skip_rows}} argument to let you 
ignore a large header, and a {{column_names}} argument that, if provided, gives 
the column names. If not provided, the first row after {{skip_rows}} is taken 
to be the column names. If it were also possible for {{column_names}} to take a 
{{false}} or {{null}} argument, then we could support the case of 
autogenerating names when none are provided and there's no header row. 
Alternatively, we could use a boolean {{header}} argument to govern whether the 
first (non-skipped) row should be interpreted as column names. (For reference, 
R's 
[readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] 
takes TRUE/FALSE/array of strings in one arg; the base 
[read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html]
 uses separate args for header and col.names. Both have a {{skip}} argument.)

I don't think there's value in trying to be clever about multirow headers and 
converting those to column names; if there's meaningful information in a tall 
header, let the user parse it themselves.

  was:
While working on ARROW-5500, I found a number of issues around the CSV parse 
options {{header_rows}}:
 * If header_rows is 0, [the reader 
errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150]
 * It's not possible to supply your own column names, as [this 
TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149]
 notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is 
enough as long as header_rows == 0 doesn't error, but then you can't naturally 
specify column types in the convert options because that takes a map of column 
name to type.
 * If header_rows is > 1, every cell gets turned into a column name, so if 
header_rows == 2, you get twice the number of column names as columns. This 
doesn't error, but it leads to unexpected results.

IMO a better interface would be to have a {{skip_rows}} argument to let you 
ignore a large header, and a {{column_names}} argument that, if provided, gives 
the column names. If not provided, the first row after {{skip_rows}} is taken 
to be the column names. I don't think there's value in trying to be clever 
about multirow headers and converting those to column names; if there's 
meaningful information in a tall header, let the user parse it themselves.


> [C++] Better column name and header support in CSV reader
> ---------------------------------------------------------
>
>                 Key: ARROW-5747
>                 URL: https://issues.apache.org/jira/browse/ARROW-5747
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Neal Richardson
>            Priority: Major
>
> While working on ARROW-5500, I found a number of issues around the CSV parse 
> options {{header_rows}}:
>  * If header_rows is 0, [the reader 
> errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150]
>  * It's not possible to supply your own column names, as [this 
> TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149]
>  notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is 
> enough as long as header_rows == 0 doesn't error, but then you can't 
> naturally specify column types in the convert options because that takes a 
> map of column name to type.
>  * If header_rows is > 1, every cell gets turned into a column name, so if 
> header_rows == 2, you get twice the number of column names as columns. This 
> doesn't error, but it leads to unexpected results.
> IMO a better interface would be to have a {{skip_rows}} argument to let you 
> ignore a large header, and a {{column_names}} argument that, if provided, 
> gives the column names. If not provided, the first row after {{skip_rows}} is 
> taken to be the column names. If it were also possible for {{column_names}} 
> to take a {{false}} or {{null}} argument, then we could support the case of 
> autogenerating names when none are provided and there's no header row. 
> Alternatively, we could use a boolean {{header}} argument to govern whether 
> the first (non-skipped) row should be interpreted as column names. (For 
> reference, R's 
> [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] 
> takes TRUE/FALSE/array of strings in one arg; the base 
> [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html]
>  uses separate args for header and col.names. Both have a {{skip}} argument.)
> I don't think there's value in trying to be clever about multirow headers and 
> converting those to column names; if there's meaningful information in a tall 
> header, let the user parse it themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to