Jothi, thanks for the explanation. One question though: why shouldn't timed out tasks be retried on a different machine ? As you pointed out, it could very well have been due to the machine having problems. To me a timeout is just like any other kind of failure.
-- Stefan > From: Jothi Padmanabhan <joth...@yahoo-inc.com> > Reply-To: <core-user@hadoop.apache.org> > Date: Mon, 13 Apr 2009 19:00:38 +0530 > To: <core-user@hadoop.apache.org> > Subject: Re: Reduce task attempt retry strategy > > Currently, only failed tasks are attempted on a node other than the one > where it failed. For killed tasks, there is no such policy for retries. > > "failed to report status" usually indicates that the task did not report > sufficient progress. However, it is possible that the task itself was not > progressing fast enough because the machine where it ran had problems. > > > On 4/8/09 12:33 AM, "Stefan Will" <stefan.w...@gmx.net> wrote: > >> My cluster has 27 nodes with a total reduce task capacity of 54. The job had >> 31 reducers. I actually had a task today that showed the behavior you're >> describing: 3 tries on one machine, and then the 4th on a different one. >> >> As for the particular job I was talking about before: >> >> Here are the stats for the job: >> >> Kind Total Tasks(successful+failed+killed) Successful tasks Failed >> tasks Killed tasks Start Time Finish Time >> Setup 1 1 0 0 4-Apr-2009 00:30:16 4-Apr-2009 >> 00:30:33 (17sec) >> Map 64 49 12 3 4-Apr-2009 00:30:33 4-Apr-2009 >> 01:11:15 (40mins, 41sec) >> Reduce 34 30 4 0 4-Apr-2009 00:30:44 4-Apr-2009 >> 04:31:36 (4hrs, 52sec) >> Cleanup 4 0 4 0 4-Apr-2009 04:31:36 4-Apr-2009 >> 06:32:00 (2hrs, 24sec) >> >> >> Not sure what to look for in the jobtracker log. All it shows for that >> particular failed task is that it assigned it to the same machine 4 times >> and then eventually failed. Perhaps something to note is that the 4 failures >> were all due to timeouts: >> >> "Task attempt_200904031942_0002_r_000013_3 failed to report status for 1802 >> seconds. Killing!" >> >> Also, looking at the logs, there was a map task too that was retried on that >> particuar box 4 times without going to a different one. Perhaps it had >> something to do with the way this machine failed: The jobtracker still >> considered it live, while all actual tasks assigned to it timed out. >> >> -- Stefan >> >> >> >>> From: Amar Kamat <ama...@yahoo-inc.com> >>> Reply-To: <core-user@hadoop.apache.org> >>> Date: Tue, 07 Apr 2009 10:05:16 +0530 >>> To: <core-user@hadoop.apache.org> >>> Subject: Re: Reduce task attempt retry strategy >>> >>> Stefan Will wrote: >>>> Hi, >>>> >>>> I had a flaky machine the other day that was still accepting jobs and >>>> sending heartbeats, but caused all reduce task attempts to fail. This in >>>> turn caused the whole job to fail because the same reduce task was retried >>>> 3 >>>> times on that particular machine. >>>> >>> What is your cluster size? If a task fails on a machine then its >>> re-tried on some other machine (based on number of good machines left in >>> the cluster). After certain number of failures, the machine will be >>> blacklisted (again based on number of machine left in the cluster). 3 >>> different reducers might be scheduled on that machine but that should >>> not lead to job failure. Can you explain in detail what exactly >>> happened. Find out where the attempts got scheduled from the >>> jobtracker's log. >>> Amar >>>> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I >>>> always thought that the framework would retry jobs on a different machine >>>> if >>>> retries on the original machine keep failing. E.g. I would have expected to >>>> retry once or twice on the same machine, but then switch to a different one >>>> to minimize the likelihood of getting stuck on a bad machine. >>>> >>>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for >>>> improving on this in the future ? >>>> >>>> Thanks, >>>> Stefan >>>> >>>> >> >>