[ https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826706#comment-13826706 ]
Jason Lowe commented on HADOOP-9622: ------------------------------------ Turns out there's already a followup for multibyte custom delimiters at HADOOP-9867, so I'll add the testcase and relevant details to that JIRA. Thanks for the review, Chris. Given your earlier +1 I think this is now ready to go as-is. If there are no objections I'll commit this in the next few days. > bzip2 codec can drop records when reading data in splits > -------------------------------------------------------- > > Key: HADOOP-9622 > URL: https://issues.apache.org/jira/browse/HADOOP-9622 > Project: Hadoop Common > Issue Type: Bug > Components: io > Affects Versions: 2.0.4-alpha, 0.23.8 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Critical > Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, > HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2 > > > Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when > reading them in splits based on where record delimiters occur relative to > compression block boundaries. > Thanks to [~knoguchi] for discovering this problem while working on PIG-3251. -- This message was sent by Atlassian JIRA (v6.1#6144)