Qiming He created HADOOP-9442: --------------------------------- Summary: Splitting issue when using NLineInputFormat with compression Key: HADOOP-9442 URL: https://issues.apache.org/jira/browse/HADOOP-9442 Project: Hadoop Common Issue Type: Bug Affects Versions: 1.1.2 Environment: Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same result. Reporter: Qiming He Priority: Minor
$ cat abook.txt | base64 –w 0 >onelinetext.b64 $ hadoop fs –put onelinetext.b64 /input/onelinetext.b64 $ hadoop jar hadoop-streaming.jar \ -input /input/onelinetext.b64 \ -output /output \ -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \ –mapper wc Num task: 1, and output has one line: Line 1: 1 2 202699 which makes sense because one line per mapper is intended. Them, using compression with NLineInputFormat $ bzip2 onelinetext.b64 $ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.bz2 $ hadoop jar hadoop-streaming.jar \ -Dmapred.input.compress=true \ -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ -input /input/onelinetext.b64.bz2 \ -output /output \ -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \ –mapper wc I am expecting the same results as above, 'coz decompressing should occur before processing one-line text (i.e. wc), however, I am getting: Num task: 397 (or other large num depends on environments), and output has 397 lines: Line1-396: 0 0 0 Line 397: 1 2 202699 Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I purposely choose gzip because I believe it is NOT split-able. I got similar results when using bzip2 and lzop codecs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira