[ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354819#comment-16354819 ]
Aki Tanaka commented on HADOOP-15206: ------------------------------------- Thank you very much for the comments! {quote} This doesn't just drop the first bzip2 block, it drops the entire split. This goes back to my previous comment about the code assuming splits that start between bytes 1-4 are always tiny. Splits do not have to be equally sized, so theoretically there could be just two splits where the first split is a two-byte split starting at offset 0 and the other split is the rest of the file. {quote} Thank you for explaining the details. I understand the problem. {quote}The logic regarding the header seems backwards. If the header is stripped then that means there was a header present, yet the logic is only adding up bytes for a header length if it was not stripped which is the case when the header is not there. {quote} That's right... Thank you for pointing this out. After some tests, I noticed the following 2 points. 1. When reading from position 1-3 (on bzip2 header), isHeaderStripped/isSubHeaderStripped is always false. This is because the current readStreamHeader() works only when the start position is 0. 2. I set one byte beyond the start of the first bzip2 block (header_len + 1) to the InputStream's start position, but duplicated records issue still happened. When I set header_len + 5 (9), we can avoid the problem. As far as I looked at the test bz2 file using binary editor, the first bz2 marker starts from position 4 (right after bz2 header). Still trying to understand why we need to set header_len + 5. > BZip2 drops and duplicates records when input split size is small > ----------------------------------------------------------------- > > Key: HADOOP-15206 > URL: https://issues.apache.org/jira/browse/HADOOP-15206 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.8.3, 3.0.0 > Reporter: Aki Tanaka > Priority: Major > Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, > HADOOP-15206.002.patch > > > BZip2 can drop and duplicate record when input split file is small. I > confirmed that this issue happens when the input split size is between 1byte > and 4bytes. > I am seeing the following 2 problem behaviors. > > 1. Drop record: > BZip2 skips the first record in the input file when the input split size is > small > > Set the split size to 3 and tested to load 100 records (0, 1, 2..99) > {code:java} > 2018-02-01 10:52:33,502 INFO [Thread-17] mapred.TestTextInputFormat > (TestTextInputFormat.java:verifyPartitions(317)) - > splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3 > count=99{code} > > The input format read only 99 records but not 100 records > > 2. Duplicate Record: > 2 input splits has same BZip2 records when the input split size is small > > Set the split size to 1 and tested to load 100 records (0, 1, 2..99) > > {code:java} > 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat > (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file > /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1 > count=99 > 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat > (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 > at position 8 > {code} > > I experienced this error when I execute Spark (SparkSQL) job under the > following conditions: > * The file size of the input files are small (around 1KB) > * Hadoop cluster has many slave nodes (able to launch many executor tasks) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org