[ https://issues.apache.org/jira/browse/DRILL-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Khurram Faraaz updated DRILL-5487: ---------------------------------- Component/s: Storage - Text & CSV > Vector corruption in CSV with headers and truncated last row > ------------------------------------------------------------ > > Key: DRILL-5487 > URL: https://issues.apache.org/jira/browse/DRILL-5487 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Text & CSV > Affects Versions: 1.10.0 > Reporter: Paul Rogers > > The CSV format plugin allows two ways of reading data: > * As named columns > * As a single array, called {{columns}}, that holds all columns for a row > The named columns feature will corrupt the offset vectors if the last row of > the file is truncated: leaves off one or more columns. > To illustrate the CSV data corruption, I created a CSV file, test4.csv, of > the following form: > {code} > h,u > abc,def > ghi > {code} > Note that the file is truncated: the command and second field is missing on > the last line. > Then, I created a simple test using the "cluster fixture" framework: > {code} > @Test > public void readerTest() throws Exception { > FixtureBuilder builder = ClusterFixture.builder() > .maxParallelization(1); > try (ClusterFixture cluster = builder.build(); > ClientFixture client = cluster.clientFixture()) { > TextFormatConfig csvFormat = new TextFormatConfig(); > csvFormat.fieldDelimiter = ','; > csvFormat.skipFirstLine = false; > csvFormat.extractHeader = true; > cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat); > String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10"; > client.queryBuilder().sql(sql).printCsv(); > } > } > {code} > The results show we've got a problem: > {code} > Exception (no rows returned): > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: > IllegalArgumentException: length: -3 (expected: >= 0) > {code} > If the last line were: > {code} > efg, > {code} > Then the offset vector should look like this: > {code} > [0, 3, 3] > {code} > Very likely we have an offset vector that looks like this instead: > {code} > [0, 3, 0] > {code} > When we compute the second column of the second row, we should compute: > {code} > length = offset[2] - offset[1] = 3 - 3 = 0 > {code} > Instead we get: > {code} > length = offset[2] - offset[1] = 0 - 3 = -3 > {code} > The summary is that a premature EOF appears to cause the "missing" columns to > be skipped; they are not filled with a blank value to "bump" the offset > vectors to fill in the last row. Instead, they are left at 0, causing havoc > downstream in the query. -- This message was sent by Atlassian JIRA (v6.4.14#64029)