We have overridden the base class public class MapReduceBase extends
org.apache.hadoop.mapred.MapReduceBase
to have the configure method log the split name and split section (or in
the case of gzip'd files the file name).
We find it very helpful to make the job errors to the section of the
input file causing the problem.
Vadim Zaliva wrote:
I have a bunch of gzip files which I am trying to process with Hadoop
task. The task fails with exception:
java.io.EOFException: Unexpected end of ZLIB input stream at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.read(GzipCodec.java:124)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read(BufferedInputStream.java:237) at
org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136)
at
org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128)
at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117)
at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)
I guess some of files are invalid. However I could not find anywhere
in logs file name of the file causing this exception. Due to the huge
size of the dataset I would not want to extract files from DFS and
verify them with Gzip one by one. Any suggestions? Thanks!
Sincerely,
Vadim
--
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested