[jira] [Created] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

Paul Rogers (JIRA) Mon, 08 May 2017 14:11:35 -0700

Paul Rogers created DRILL-5487:
----------------------------------

             Summary: Vector corruption in CSV with headers and truncated last 
row
                 Key: DRILL-5487
                 URL: https://issues.apache.org/jira/browse/DRILL-5487
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers



The CSV format plugin allows two ways of reading data:

* As named columns
* As a single array, called {{columns}}, that holds all columns for a row

The named columns feature will corrupt the offset vectors if the last row of 
the file is truncated: leaves off one or more columns.

To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the 
following form:

{code}
h,u
abc,def
ghi
{code}

Note that the file is truncated: the command and second field is missing on the 
last line.

Then, I created a simple test using the "cluster fixture" framework:

{code}
  @Test
  public void readerTest() throws Exception {
    FixtureBuilder builder = ClusterFixture.builder()
        .maxParallelization(1);

    try (ClusterFixture cluster = builder.build();
         ClientFixture client = cluster.clientFixture()) {
      TextFormatConfig csvFormat = new TextFormatConfig();
      csvFormat.fieldDelimiter = ',';
      csvFormat.skipFirstLine = false;
      csvFormat.extractHeader = true;
      cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
      String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
      client.queryBuilder().sql(sql).printCsv();
    }
  }
{code}

The results show we've got a problem:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
IllegalArgumentException: length: -3 (expected: >= 0)
{code}

If the last line were:

{code}
efg,
{code}

Then the offset vector should look like this:

{code}
[0, 3, 3]
{code}

Very likely we have an offset vector that looks like this instead:

{code}
[0, 3, 0]
{code}

When we compute the second column of the second row, we should compute:

{code}
length = offset[2] - offset[1] = 3 - 3 = 0
{code}

Instead we get:

{code}
length = offset[2] - offset[1] = 0 - 3 = -3
{code}

The summary is that a premature EOF appears to cause the "missing" columns to 
be skipped; they are not filled with a blank value to "bump" the offset vectors 
to fill in the last row. Instead, they are left at 0, causing havoc downstream 
in the query.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

Reply via email to