[ https://issues.apache.org/jira/browse/ARROW-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-5747: ----------------------------------- Description: While working on ARROW-5500, I found a number of issues around the CSV parse options {{header_rows}}: * If header_rows is 0, [the reader errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150] * It's not possible to supply your own column names, as [this TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149] notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is enough as long as header_rows == 0 doesn't error, but then you can't naturally specify column types in the convert options because that takes a map of column name to type. * If header_rows is > 1, every cell gets turned into a column name, so if header_rows == 2, you get twice the number of column names as columns. This doesn't error, but it leads to unexpected results. IMO a better interface would be to have a {{skip_rows}} argument to let you ignore a large header, and a {{column_names}} argument that, if provided, gives the column names. If not provided, the first row after {{skip_rows}} is taken to be the column names. If it were also possible for {{column_names}} to take a {{false}} or {{null}} argument, then we could support the case of autogenerating names when none are provided and there's no header row. Alternatively, we could use a boolean {{header}} argument to govern whether the first (non-skipped) row should be interpreted as column names. (For reference, R's [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] takes TRUE/FALSE/array of strings in one arg; the base [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html] uses separate args for header and col.names. Both have a {{skip}} argument.) I don't think there's value in trying to be clever about multirow headers and converting those to column names; if there's meaningful information in a tall header, let the user parse it themselves. was: While working on ARROW-5500, I found a number of issues around the CSV parse options {{header_rows}}: * If header_rows is 0, [the reader errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150] * It's not possible to supply your own column names, as [this TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149] notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is enough as long as header_rows == 0 doesn't error, but then you can't naturally specify column types in the convert options because that takes a map of column name to type. * If header_rows is > 1, every cell gets turned into a column name, so if header_rows == 2, you get twice the number of column names as columns. This doesn't error, but it leads to unexpected results. IMO a better interface would be to have a {{skip_rows}} argument to let you ignore a large header, and a {{column_names}} argument that, if provided, gives the column names. If not provided, the first row after {{skip_rows}} is taken to be the column names. I don't think there's value in trying to be clever about multirow headers and converting those to column names; if there's meaningful information in a tall header, let the user parse it themselves. > [C++] Better column name and header support in CSV reader > --------------------------------------------------------- > > Key: ARROW-5747 > URL: https://issues.apache.org/jira/browse/ARROW-5747 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Neal Richardson > Priority: Major > > While working on ARROW-5500, I found a number of issues around the CSV parse > options {{header_rows}}: > * If header_rows is 0, [the reader > errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150] > * It's not possible to supply your own column names, as [this > TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149] > notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is > enough as long as header_rows == 0 doesn't error, but then you can't > naturally specify column types in the convert options because that takes a > map of column name to type. > * If header_rows is > 1, every cell gets turned into a column name, so if > header_rows == 2, you get twice the number of column names as columns. This > doesn't error, but it leads to unexpected results. > IMO a better interface would be to have a {{skip_rows}} argument to let you > ignore a large header, and a {{column_names}} argument that, if provided, > gives the column names. If not provided, the first row after {{skip_rows}} is > taken to be the column names. If it were also possible for {{column_names}} > to take a {{false}} or {{null}} argument, then we could support the case of > autogenerating names when none are provided and there's no header row. > Alternatively, we could use a boolean {{header}} argument to govern whether > the first (non-skipped) row should be interpreted as column names. (For > reference, R's > [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] > takes TRUE/FALSE/array of strings in one arg; the base > [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html] > uses separate args for header and col.names. Both have a {{skip}} argument.) > I don't think there's value in trying to be clever about multirow headers and > converting those to column names; if there's meaningful information in a tall > header, let the user parse it themselves. -- This message was sent by Atlassian JIRA (v7.6.3#76005)