On Jan 29, 2008, at 10:50, Ted Dunning wrote:

I was using library RegexMapper. I did the following to add
logging which did the trick:


import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.lib.RegexMapper;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class LoggingRegexMapper extends RegexMapper
{
public static final Log LOG = LogFactory.getLog("LoggingRegexMapper");

    public void configure(JobConf job)
    {
        super.configure(job);
        LOG.info("Input file="+job.get("map.input.file"));
    }

}

Vadim


Vadim,

IF you drill into the task using the job tracker's web interface, you can get to the tasks xml configuration. That configuration will have the input
file split specification in it.

You may also be able to see the input file elsewhere, but the xml
configuration is definitive.


On 1/29/08 10:33 AM, "Vadim Zaliva" <[EMAIL PROTECTED]> wrote:

I have a bunch of gzip files which I am trying to process with Hadoop
task. The task fails with exception:
java.io.EOFException: Unexpected end of ZLIB input stream at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:
141) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at
org.apache.hadoop.io.compress.GzipCodec
$GzipInputStream.read(GzipCodec.java:124) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read(BufferedInputStream.java:237) at
org
.apache .hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:
136) at
org
.apache .hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:
128) at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
117) at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
39) at org.apache.hadoop.mapred.MapTask
$TrackedRecordReader.next(MapTask.java:147) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 2016)
I guess some of files are invalid. However I could not find anywhere
in logs file name of the file causing this exception. Due to the huge
size of the dataset I would not want to extract files from DFS and
verify them with Gzip one by one. Any suggestions? Thanks!
Sincerely,
Vadim




Reply via email to