[jira] [Commented] (DRILL-5498) CSV text reader does not properly handle duplicate header names

Paul Rogers (JIRA) Wed, 10 May 2017 14:30:29 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005476#comment-16005476
 ]


Paul Rogers commented on DRILL-5498:
------------------------------------

Proposed solution:

* Avoid failing queries unless absolutely necessary. Try to "fix up" headers 
where possible.
* Ignore leading and trailing whitespace in a header
* If the header has no headers at all, fail the query.
* If any header is empty, make up a header of the form "column_x" where x is 
the column position.
* If any header contains invalid SQL symbol characters, replace the character 
with "_".
* If the first character of a header is invalid, replace that character with 
"col_" (since underscore is not valid as the first character.)
* If header j duplicates header i, i < j, append "_x" to header j, where x is 
2, 3, 4, ... until a unique name is found.
* Allow Unicode characters in headers.

For example:

{code}
Headers: a, b, c
Column names: a, b, c

Headers: (none)
Produce an error

Headers:  ,  , (blank headers)
Column names: column_1, column_2, column_3

Headers: _a, 99, h!
Column names: col_a, col_99, h_

Headers: a, a, a
Column names: a, a_2, a_3
{code}

Headers that worked in the prior version continue to work. Headers that failed 
in the prior version now work.

> CSV text reader does not properly handle duplicate header names
> ---------------------------------------------------------------
>
>                 Key: DRILL-5498
>                 URL: https://issues.apache.org/jira/browse/DRILL-5498
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider the following CSV file:
> {code}
> h,h,h
> a,b,c
> d,e,f
> {code}
> Parse this with the CSV storage plugins to parse headers. The result:
> {code}
> 2 row(s):
> h
> c
> f
> {code}
> Expected a runtime error for the duplicate column names, or automatic 
> "uniqification" of the names. Certainly did not expect the first two columns 
> to be dropped.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5498) CSV text reader does not properly handle duplicate header names

Reply via email to