[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

Paul Rogers (JIRA) Mon, 08 May 2017 16:29:57 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001784#comment-16001784
 ]


Paul Rogers commented on DRILL-5486:
------------------------------------

The net result seems to be that the per-batch row count and setting in 
{{BaseRepeatedValueVector}} is buggy, but irrelevant.


> Compliant text reader tries, but fails, to ignore empty rows
> ------------------------------------------------------------
>
>                 Key: DRILL-5486
>                 URL: https://issues.apache.org/jira/browse/DRILL-5486
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
>     ...
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>     ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
>     recordStart = characterData;
>     ...
>     int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
>     PlatformDependent.putInt(repeatedOffset, newOffset);
>     repeatedOffset += 4;
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the next level of code keeps a separate 
> count, so the "skip empty row" logic causes the counts to get out of sync 
> between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the 
> first row.
> * That bug is masked because we always set the row offset, even if we ignore 
> the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting 
> the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

Reply via email to