Hi!

I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of
work, one of the worker nodes just simply died (I cannot connect to the
instance either). That node also appears on the dfshealth as dead node.
Until now everything is normal.

Unfortunately the job it was running didn't survived. The cluster it had 8
worker nodes, each with 4 mappers and 2 reducers. The job in cause it had
~1200 map tasks and 10 reduce tasks.
One of the node died and I see around 31 failed attempts in the jobtracker
log.  The log is very similar with the one  somebody placed it here:
http://pastie.org/pastes/1270614

Some of the attempts (but not all!) has been retried and I saw at least two
of them which finally is getting in a successful state.
The following two lines appears several times in my jobtracker log:
2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Running
list for reducers missing!! Job details are missing.
2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Failed
cache for reducers missing!! Job details are missing.

These pair of log lines could be the signals that the job couldn't be
finished by re-scheduling the failed attempts.
Nothing special I have seen in namenode logs.

Of course I rerun the failed job which finished successfully. But my problem
is that I would like to understand the failover conditions. What could be
lost, which part of the hadoop is not fault tolerant in this sense that it
happens to see those warnings mentioned earlier. Is there a chance to
control such kind of situations?

I am using CDH3b3 version, so it is a developing version of Hadoop.
Somebody knows about a special bug or fix which in the near future can solve
the problem?

Regards
Tibor Kiss

Reply via email to