Ankit Kamboj created HADOOP-11445:
-------------------------------------

             Summary: Bzip2Codec: Data block is skipped when position of newly 
created stream is equal to start of split
                 Key: HADOOP-11445
                 URL: https://issues.apache.org/jira/browse/HADOOP-11445
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Ankit Kamboj


bz2 input files are handled by FileInputFormat+LineRecordReader. In 
LineRecordReader, bz2 specific compressed input stream is created to iterate 
over records. After every new creation, the stream points to the beginning of 
next data block. The logic to find the beginning of next block depends on start 
of the split. The search begins at 10 bytes behind the start of split. If the 
first search creates input stream whose position is before or at start of 
split, next block beginning is sought (assuming that the record reader for 
previous split would have already iterated over the the data block in which 
current start of split lies). If the split start is just at the byte where a 
newly created stream is positioned (start of data block), attempt is made to 
find beginning of next data block. This doesn't seem correct because this will 
result in jumping a whole block and will result in missing records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to