Paul Rogers created DRILL-5914:
----------------------------------
Summary: CSV (text) reader fails to parse quoted newlines in
trailing fields
Key: DRILL-5914
URL: https://issues.apache.org/jira/browse/DRILL-5914
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.11.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. The
input file is as follows:
```
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
```
Note the newline in side the description in the last record.
If we do a `SELECT *` query, the file is parsed fine; we get 4 records.
If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: it
short-circuits reads on the three columns that are not wanted:
```
TextReader.parseRecord() {
...
if (earlyTerm) {
if (ch != newLine) {
input.skipLines(1); // <-- skip lines
}
break;
}
```
This method skips forward in the file, discarding characters until it hits a
newline:
```
do {
nextChar();
} while (lineCount < expectedLineCount);
```
Note that this code handles individual characters, it is not aware of per-field
semantics. That is, unlike the higher-level parser methods, the `nextChar()`
method does not consider newlines inside of quoted fields to be special.
This problem shows up acutely in a `SELECT COUNT(*)` style query that skips all
fields; the result is we count the input as five lines, not four.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)