I am running Hadoop streaming. After around 42 jobs on an 18-node
cluster, the jobtracker stops responding. This happens on normally-
working code. Here are the symptoms.
1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20
seconds) before failure
b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before
failure
c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure
The ticker looks like this.
09/04/06 03:06:28 INFO streaming.StreamJob: map 24% reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob: map 25% reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.
In the jobtracker log, I see this.
2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-8143535428142072268_95993 failed because
recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will
retry...
After restarting both dfs and mapreduce on all nodes, the problem
goes away, and the formally non-working job proceeds without failure.
Does anyone else see this problem?
David Kellogg