Paul Rogers created DRILL-5914: ---------------------------------- Summary: CSV (text) reader fails to parse quoted newlines in trailing fields Key: DRILL-5914 URL: https://issues.apache.org/jira/browse/DRILL-5914 Project: Apache Drill Issue Type: Bug Affects Versions: 1.11.0 Reporter: Paul Rogers Assignee: Paul Rogers
Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. The input file is as follows: ``` Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00 ``` Note the newline in side the description in the last record. If we do a `SELECT *` query, the file is parsed fine; we get 4 records. If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: it short-circuits reads on the three columns that are not wanted: ``` TextReader.parseRecord() { ... if (earlyTerm) { if (ch != newLine) { input.skipLines(1); // <-- skip lines } break; } ``` This method skips forward in the file, discarding characters until it hits a newline: ``` do { nextChar(); } while (lineCount < expectedLineCount); ``` Note that this code handles individual characters, it is not aware of per-field semantics. That is, unlike the higher-level parser methods, the `nextChar()` method does not consider newlines inside of quoted fields to be special. This problem shows up acutely in a `SELECT COUNT(*)` style query that skips all fields; the result is we count the input as five lines, not four. -- This message was sent by Atlassian JIRA (v6.4.14#64029)