[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001784#comment-16001784 ]
Paul Rogers commented on DRILL-5486: ------------------------------------ The net result seems to be that the per-batch row count and setting in {{BaseRepeatedValueVector}} is buggy, but irrelevant. > Compliant text reader tries, but fails, to ignore empty rows > ------------------------------------------------------------ > > Key: DRILL-5486 > URL: https://issues.apache.org/jira/browse/DRILL-5486 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.10.0 > Reporter: Paul Rogers > Priority: Minor > > The "compliant" text reader (for CSV files, etc.) has two modes: reading with > headers (which creates a scalar vector per file column), or without headers > (that creates a single {{columns}} vector with an array of all columns. > When run in array mode, the code uses a class called > {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: > {code} > public void finishRecord() { > ... > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > ... > {code} > As it turns out, this works only on the *first* row. On the first row, the > {{fieldIndex}} has its initial value of -1. But, for subsequent records, > {{fieldIndex}} is never reset and retains its value from the last field set. > The result is that the code skips the first empty row, but not empty rows > elsewhere in the file. > Further, on the first row, the logic is flawed. The part shown above is only > for the row counts. Here is more of the logic: > {code} > @Override > public void finishRecord() { > recordStart = characterData; > ... > int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; > PlatformDependent.putInt(repeatedOffset, newOffset); > repeatedOffset += 4; > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > } > {code} > Note that the underlying vector *is* bumped for the record, even though the > row count is not. Later, this will cause the container to have one less > record. > This would be a problem if the row count ({{batchIndex}}) was used for the > batch row count. But, it is not, the next level of code keeps a separate > count, so the "skip empty row" logic causes the counts to get out of sync > between the {{RepeatedVarCharOutput}} and the next level of reader. > In sense, there are multiple self-cancelling bugs: > * The skip-empty-row logic is not synchronized with the upper layer. > * That bug is partly masked because the logic is not triggered except on the > first row. > * That bug is masked because we always set the row offset, even if we ignore > the row in the row count. > * That bug is masked because the incorrect row count is ignored when setting > the container row count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)