[jira] [Updated] (DRILL-5662) Compliant text reader (CSV) opens, closes, reopens file with headers

Paul Rogers (JIRA) Thu, 06 Jul 2017 09:50:24 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-5662:
-------------------------------
    Description: 
The "compliant" (CSV) reader can optional read headers from a file. To do so, 
the reader:

* Opens the input stream
* Reads headers
* Closes the input stream
* Opens the input stream
* Reads data (skipping headers)
* Closes the input stream

While the above certainly works, it has an unnecessary close/open cycle. Many 
CSV readers simply read the header and use the same stream to read data. Drill 
should do so also.

In fact, Drill has historically coded its own headers scanner. The first was 
badly broken, but DRILL-5498 improved the parsing (though not file handling.)

Given that Drill's "compliant" text reader is based on the UniVocity library, 
and that library can parse headers, we should probably just reuse that existing 
code which has, very likely, evolved to handle the header usages seen in the 
wild.

Text files allow "splits", so there are two cases here:

* A small file (or the first split) in which the header is contiguous with the 
data. This is the case we should modify the code to support.
* A large file where the reader reads the second or subsequent split. In this 
case, the header is not contiguous with the data. In this case, the current 
behavior of opening the file twice is perhaps the best solution (since the 
splits are probably on separate nodes.)

  was:
The "compliant" (CSV) reader can optional read headers from a file. To do so, 
the reader:

* Opens the input stream
* Reads headers
* Closes the input stream
* Opens the input stream
* Reads data (skipping headers)
* Closes the input stream

While the above certainly works, it has an unnecessary close/open cycle. Many 
CSV readers simply read the header and use the same stream to read data. Drill 
should do so also.

In fact, Drill has historically coded its own headers scanner. The first was 
badly broken, but DRILL-5498 improved the parsing (though not file handling.)

Given that Drill's "compliant" text reader is based on the UniVocity library, 
and that library can parse headers, we should probably just reuse that existing 
code which has, very likely, evolved to handle the header usages seen in the 
wild.


> Compliant text reader (CSV) opens, closes, reopens file with headers
> --------------------------------------------------------------------
>
>                 Key: DRILL-5662
>                 URL: https://issues.apache.org/jira/browse/DRILL-5662
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: Future
>
>
> The "compliant" (CSV) reader can optional read headers from a file. To do so, 
> the reader:
> * Opens the input stream
> * Reads headers
> * Closes the input stream
> * Opens the input stream
> * Reads data (skipping headers)
> * Closes the input stream
> While the above certainly works, it has an unnecessary close/open cycle. Many 
> CSV readers simply read the header and use the same stream to read data. 
> Drill should do so also.
> In fact, Drill has historically coded its own headers scanner. The first was 
> badly broken, but DRILL-5498 improved the parsing (though not file handling.)
> Given that Drill's "compliant" text reader is based on the UniVocity library, 
> and that library can parse headers, we should probably just reuse that 
> existing code which has, very likely, evolved to handle the header usages 
> seen in the wild.
> Text files allow "splits", so there are two cases here:
> * A small file (or the first split) in which the header is contiguous with 
> the data. This is the case we should modify the code to support.
> * A large file where the reader reads the second or subsequent split. In this 
> case, the header is not contiguous with the data. In this case, the current 
> behavior of opening the file twice is perhaps the best solution (since the 
> splits are probably on separate nodes.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (DRILL-5662) Compliant text reader (CSV) opens, closes, reopens file with headers

Reply via email to