Niels Basjes created MAPREDUCE-5925:
---------------------------------------

             Summary: NLineInputFormat silently produces garbage on gzipped 
input
                 Key: MAPREDUCE-5925
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5925
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Niels Basjes
            Priority: Critical


[ Found while investigating the impact of MAPREDUCE-2094 ]

The org.apache.hadoop.mapreduce.lib.input.NLineInputFormat (probably the mapred 
version too) only makes sense for splittable files.

This inputformat uses the isSplitable from its superclass FileInputFormat 
(which always returns true) in combination with the LineRecordReader.

When you provide it a gzipped file (non-splittable compression) it will create 
multiple splits (isSplitable == true) yet the LineRecordReader cannot handle 
the gzipped file in multiple splits because the GzipCodec does not support this.

Overall effect is that you get incorrect results.

Proposed solution: Add detection for this kind of scenario and let the 
NLineInputFormat fail hard when someone tries this. 

I'm not sure if this should go into the LineRecordReader or only in the 
NLineInputFormat.







--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to