Today I've met a strange situation: during running of some MapReduce jobs all 
the TaskTrackers in the cluster simply disappeared without any apparent reason, 
but the JobTracker have remained alive like nothing have happened. Even it's 
web interface was running showing zero capacity for maps and reduces and all 
the same jobs in the running state (in fact, TaskTracker$Childs have also 
remained in memory). Examination of tasktracker's logs resulted in (almost) 
same exception in the tail, like this one:

2008-09-16 06:27:11,244 WARN org.apache.hadoop.mapred.TaskTracker: Error 
initializing task_200809151253_1938_m_000003_0:
java.lang.InternalError: jzentry == 0,
 jzfile = 46912646564160,
 total = 148,
 name = 
/data/hadoop/root/mapred/local/taskTracker/jobcache/job_200809151253_1938/jars/job.jar,
 i = 3,
 message = invalid LOC header (bad signature)
        at java.util.zip.ZipFile$3.nextElement(ZipFile.java:429)
        at java.util.zip.ZipFile$3.nextElement(ZipFile.java:415)
        at java.util.jar.JarFile$1.nextElement(JarFile.java:221)
        at java.util.jar.JarFile$1.nextElement(JarFile.java:220)
        at org.apache.hadoop.util.RunJar.unJar(RunJar.java:40)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:708)
        at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
        at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)

The exact exception strings were different, but the stacktraces on all nodes 
were close to each other. At the beginning it looked like that I can just shrug 
it off and simply start them once more, but something have gone wrong. The fsck 
utility found some missing blocks and the HBase running on the same cluster 
have just simply became unavailable and later failed to start up (seems to be 
the connected issues). The HBase reported the SocketTimeoutExceptions (in fact 
only about two servers simultaneously each time, but after cluster-restart the 
role of "victims" have transferred to other nodes), while in the HDFS logs 
sometimes emerged the messages about unability to find some old blocks or 
create a new ones. I've double-checked the possible variants: some DNS 
problems, network collisions, iptables, possible disk corruption, or something 
like that, but even complete cluster reboot haven't changed the situation a bit.

P.S.: Hadoop 0.17.1, HBase 0.2.0, Debian Etch
P.S.: If this does matter: the MR jobs running at that moment have performed 
some manipulation with data in HBase and all the blocks which report some 
problems are located in HBase root directory (at least it looks like that).

Thanks,
Ivan Blinkov

Reply via email to