[ https://issues.apache.org/jira/browse/DRILL-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers resolved DRILL-5914. -------------------------------- Resolution: Fixed This issue was fixed as part of the "Complaint text reader V3" project. The test cited in the description now correctly reports 4 lines for the {{COUNT(*)}} query. > CSV (text) reader fails to parse quoted newlines in trailing fields > ------------------------------------------------------------------- > > Key: DRILL-5914 > URL: https://issues.apache.org/jira/browse/DRILL-5914 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.11.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Major > > Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. > The input file is as follows: > {noformat} > Year,Make,Model,Description,Price > 1997,Ford,E350,"ac, abs, moon",3000.00 > 1999,Chevy,"Venture ""Extended Edition""","",4900.00 > 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 > 1996,Jeep,Grand Cherokee,"MUST SELL! > air, moon roof, loaded",4799.00 > {noformat} > Note the newline in side the description in the last record. > If we do a `SELECT *` query, the file is parsed fine; we get 4 records. > If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: > it short-circuits reads on the three columns that are not wanted: > {code} > TextReader.parseRecord() { > ... > if (earlyTerm) { > if (ch != newLine) { > input.skipLines(1); // <-- skip lines > } > break; > } > {code} > This method skips forward in the file, discarding characters until it hits a > newline: > {code} > do { > nextChar(); > } while (lineCount < expectedLineCount); > {code} > Note that this code handles individual characters, it is not aware of > per-field semantics. That is, unlike the higher-level parser methods, the > `nextChar()` method does not consider newlines inside of quoted fields to be > special. > This problem shows up acutely in a `SELECT COUNT\(*)` style query that skips > all fields; the result is we count the input as five lines, not four. -- This message was sent by Atlassian Jira (v8.3.4#803005)