[jira] [Created] (HADOOP-18400) Fix file split duplicating records from a succeeding split when reading BZip2 text files

groot (Jira) Wed, 10 Aug 2022 18:25:10 -0700

groot created HADOOP-18400:
------------------------------

             Summary:  Fix file split duplicating records from a succeeding 
split when reading BZip2 text files 
                 Key: HADOOP-18400
                 URL: https://issues.apache.org/jira/browse/HADOOP-18400
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 3.3.3, 3.3.4
            Reporter: groot
            Assignee: groot



Fix data correctness issue with TextInputFormat that can occur when reading 
BZip2 compressed text files. When a file split's range does not include the 
start position of a BZip2 block, then it is expected to contain no records 
(i.e. the split is empty). However, if it so happens that the end of this split 
(exclusive) is at the start of a BZip2 block, then LineRecordReader ends up 
returning all the records for that BZip2 block. This ends up duplicating 
records read by a job because the next split would also end up returning all 
the records for the same block (since its range would include the start of that 
block).

This bug does not get triggered when the file split's range does include the 
start of at least one block and ends just before the start of another block. 
The reason for this has to do with when BZip2CompressionInputStream updates its 
position when using the BYBLOCK READMODE. Using this read mode, the stream's 
position while reading only gets updated when reading the first byte past an 
end of a block marker. The bug is that if the stream, when initialized, was 
adjusted to be at the end of one block, then we don't update the position after 
we read the first byte of the next block. Rather, we keep the position to be 
equal to the next block marker we've initialized to. If the exclusive end 
position of the split is equal to stream's position, LineRecordReader will 
continue to read lines until the position is updated (an an additional record 
in the next block is read if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-18400) Fix file split duplicating records from a succeeding split when reading BZip2 text files

Reply via email to