Niels Basjes created MAPREDUCE-5925:
---------------------------------------
Summary: NLineInputFormat silently produces garbage on gzipped
input
Key: MAPREDUCE-5925
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5925
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Niels Basjes
Priority: Critical
[ Found while investigating the impact of MAPREDUCE-2094 ]
The org.apache.hadoop.mapreduce.lib.input.NLineInputFormat (probably the mapred
version too) only makes sense for splittable files.
This inputformat uses the isSplitable from its superclass FileInputFormat
(which always returns true) in combination with the LineRecordReader.
When you provide it a gzipped file (non-splittable compression) it will create
multiple splits (isSplitable == true) yet the LineRecordReader cannot handle
the gzipped file in multiple splits because the GzipCodec does not support this.
Overall effect is that you get incorrect results.
Proposed solution: Add detection for this kind of scenario and let the
NLineInputFormat fail hard when someone tries this.
I'm not sure if this should go into the LineRecordReader or only in the
NLineInputFormat.
--
This message was sent by Atlassian JIRA
(v6.2#6252)