Qiming He created HADOOP-9442:
---------------------------------

             Summary: Splitting issue when using NLineInputFormat with 
compression
                 Key: HADOOP-9442
                 URL: https://issues.apache.org/jira/browse/HADOOP-9442
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 1.1.2
         Environment: Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same 
result.
            Reporter: Qiming He
            Priority: Minor



$ cat abook.txt | base64 –w 0 >onelinetext.b64
$ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
$ hadoop jar hadoop-streaming.jar  \
    -input /input/onelinetext.b64 \
    -output /output \
    -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
    –mapper wc 
Num task: 1, and output has one line:
Line 1: 1 2 202699
which makes sense because one line per mapper is intended.

Them, using compression with NLineInputFormat 
$ bzip2 onelinetext.b64
$ hadoop fs –put onelinetext.b64.bz2  /input/onelinetext.b64.bz2
$ hadoop jar hadoop-streaming.jar \
      -Dmapred.input.compress=true \
      -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
      -input /input/onelinetext.b64.bz2 \
      -output /output \
      -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
      –mapper wc 
I am expecting the same results as above, 'coz decompressing should occur 
before processing one-line text (i.e. wc), however, I am getting:

Num task: 397 (or other large num depends on environments), and output has 397 
lines:
Line1-396: 0 0 0
Line 397: 1 2 202699

Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I 
purposely choose gzip because I believe it is NOT split-able. I got similar 
results when using bzip2 and lzop codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to