I am running Hadoop streaming. After around 42 jobs on an 18-node cluster, the jobtracker stops responding. This happens on normally- working code. Here are the symptoms.

1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20 seconds) before failure b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before failure
     c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure


The ticker looks like this.

09/04/06 03:06:28 INFO streaming.StreamJob:  map 24%  reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob:  map 25%  reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.

In the jobtracker log, I see this.

2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8143535428142072268_95993 failed because recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will retry...

After restarting both dfs and mapreduce on all nodes, the problem goes away, and the formally non-working job proceeds without failure.

Does anyone else see this problem?

David Kellogg

Reply via email to