Re: broken gzip file

Vadim Zaliva Tue, 29 Jan 2008 16:20:38 -0800


On Jan 29, 2008, at 10:50, Ted Dunning wrote:


I was using library RegexMapper. I did the following to add
logging which did the trick:


import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.lib.RegexMapper;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class LoggingRegexMapper extends RegexMapper
{

public static final Log LOG =LogFactory.getLog("LoggingRegexMapper");


    public void configure(JobConf job)
    {
        super.configure(job);
        LOG.info("Input file="+job.get("map.input.file"));
    }

}

Vadim


Vadim,

IF you drill into the task using the job tracker's web interface,you canget to the tasks xml configuration. That configuration will havethe input

file split specification in it.

You may also be able to see the input file elsewhere, but the xml
configuration is definitive.


On 1/29/08 10:33 AM, "Vadim Zaliva" <[EMAIL PROTECTED]> wrote:

I have a bunch of gzip files which I am trying to process with Hadoop
task. The task fails with exception:
java.io.EOFException: Unexpected end of ZLIB input stream at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:

141) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92)at

org.apache.hadoop.io.compress.GzipCodec
$GzipInputStream.read(GzipCodec.java:124) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read(BufferedInputStream.java:237) at
org

.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:

136) at
org

.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:

128) at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
117) at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
39) at org.apache.hadoop.mapred.MapTask
$TrackedRecordReader.next(MapTask.java:147) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at

org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)

I guess some of files are invalid. However I could not find anywhere
in logs file name of the file causing this exception. Due to the huge
size of the dataset I would not want to extract files from DFS and
verify them with Gzip one by one. Any suggestions? Thanks!
Sincerely,
Vadim

Re: broken gzip file

Reply via email to