Paul Rogers created DRILL-5487: ---------------------------------- Summary: Vector corruption in CSV with headers and truncated last row Key: DRILL-5487 URL: https://issues.apache.org/jira/browse/DRILL-5487 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers
The CSV format plugin allows two ways of reading data: * As named columns * As a single array, called {{columns}}, that holds all columns for a row The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns. To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form: {code} h,u abc,def ghi {code} Note that the file is truncated: the command and second field is missing on the last line. Then, I created a simple test using the "cluster fixture" framework: {code} @Test public void readerTest() throws Exception { FixtureBuilder builder = ClusterFixture.builder() .maxParallelization(1); try (ClusterFixture cluster = builder.build(); ClientFixture client = cluster.clientFixture()) { TextFormatConfig csvFormat = new TextFormatConfig(); csvFormat.fieldDelimiter = ','; csvFormat.skipFirstLine = false; csvFormat.extractHeader = true; cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat); String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10"; client.queryBuilder().sql(sql).printCsv(); } } {code} The results show we've got a problem: {code} Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IllegalArgumentException: length: -3 (expected: >= 0) {code} If the last line were: {code} efg, {code} Then the offset vector should look like this: {code} [0, 3, 3] {code} Very likely we have an offset vector that looks like this instead: {code} [0, 3, 0] {code} When we compute the second column of the second row, we should compute: {code} length = offset[2] - offset[1] = 3 - 3 = 0 {code} Instead we get: {code} length = offset[2] - offset[1] = 0 - 3 = -3 {code} The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query. -- This message was sent by Atlassian JIRA (v6.3.15#6346)