[ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357510#comment-16357510 ]
Aki Tanaka commented on HADOOP-15206: ------------------------------------- Added the updated patch. Please let me know if I still misunderstand something. I made following changes to the original code * Not advertising a new byte position when reading from BZip2 Header (position 0) * Move reading position to right after the BZip2 header (position 5) when the position is between 1 and 4 This implementation moves the start position forcibly without checking whether the BZ2 file has a header or not. Because I could not determine whether the header exists when the start position is 4. However, I think it's safe to move the position even if the file does not have a bz2 header because we cannot put 2 bz2 blocks in the first 4 bytes of the file. > BZip2 drops and duplicates records when input split size is small > ----------------------------------------------------------------- > > Key: HADOOP-15206 > URL: https://issues.apache.org/jira/browse/HADOOP-15206 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.8.3, 3.0.0 > Reporter: Aki Tanaka > Priority: Major > Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, > HADOOP-15206.002.patch, HADOOP-15206.003.patch > > > BZip2 can drop and duplicate record when input split file is small. I > confirmed that this issue happens when the input split size is between 1byte > and 4bytes. > I am seeing the following 2 problem behaviors. > > 1. Drop record: > BZip2 skips the first record in the input file when the input split size is > small > > Set the split size to 3 and tested to load 100 records (0, 1, 2..99) > {code:java} > 2018-02-01 10:52:33,502 INFO [Thread-17] mapred.TestTextInputFormat > (TestTextInputFormat.java:verifyPartitions(317)) - > splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3 > count=99{code} > > The input format read only 99 records but not 100 records > > 2. Duplicate Record: > 2 input splits has same BZip2 records when the input split size is small > > Set the split size to 1 and tested to load 100 records (0, 1, 2..99) > > {code:java} > 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat > (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file > /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1 > count=99 > 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat > (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 > at position 8 > {code} > > I experienced this error when I execute Spark (SparkSQL) job under the > following conditions: > * The file size of the input files are small (around 1KB) > * Hadoop cluster has many slave nodes (able to launch many executor tasks) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org