[jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits

Chris Douglas (JIRA) Thu, 26 Oct 2017 14:49:37 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221294#comment-16221294
 ]


Chris Douglas commented on HADOOP-14919:
----------------------------------------

+1 for removing the seek backward. The text reader also had bugs from that. 
Ironically, they were discovered/fixed as part of adding splittable codecs.

Looking at the code, would this support concatenated bzip files? The reader 
handling the previous block will detect the end of its stream, and a split 
following it should find the block delimiter after the header of the next file. 
However, if the text splits are around the concat point, the {{BZh9}} bytes may 
not be unaccounted for. The codec skips these at the beginning of the file and 
updates {{reportedBytesReadFromCompressedStream}}, but I didn't see handling 
for this within the stream. Similarly, if splits are arranged like this:
{noformat}
file.txt.bz2: 
[BZh93141592659xxxxxxxx3141592659xxxxxxxx0x177245385090BZh93141592659ooooooo3141592659xxxxxxxx...]
               ^split0                                                          
     ^split1
{noformat}

Would split0 pick up the {{ooooooo}} bytes?

It doesn't look like the unit tests cover a combination of multi-byte 
delimiters and splittable codecs. I don't know how thoroughly we can test that, 
without getting too deep into bzip2...

> BZip2 drops records when reading data in splits
> -----------------------------------------------
>
>                 Key: HADOOP-14919
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14919
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Aki Tanaka
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: 250000.bz2, HADOOP-14919-test.patch, 
> HADOOP-14919.001.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (001100010100000101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 01000001 01011001 00100110 01010011 01011001  1AY&SY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 250000.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 00101111                                               /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
> We will see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits

Reply via email to