Re: reduce task failing after 24 hours waiting

Billy Pearson Thu, 26 Mar 2009 00:25:14 -0700

There is many maps finshing from 4 mins to 15 mins less time closer to theend of the jobs so no timeout there. The state of the reduce task is Shufflethere grabing the map task as they finsh. the current job took 50:43:37 eachof the reduce task failed twice in that time once at 24 hours in and secondat 48 hours in. I will test on the next run in a few days the settingsmapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to 72hours and see if that solves the problem. So not a bad gess thought seamsodd within 5 min's of 24 hours both times on all the task at the same time.

looks like from the tasktracker logs I get the WARN beloworg.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1Child Error

grep the tasktracker log for one of the reduce that failed I do not havedebug turned on so all I got is the info logs

2009-03-25 18:37:45,473 INFO org.apache.hadoop.mapred.TaskTracker:attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551at 0.87 MB/s) >2009-03-25 18:37:48,476 INFO org.apache.hadoop.mapred.TaskTracker:attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551at 0.87 MB/s) >2009-03-25 18:37:49,194 INFO org.apache.hadoop.mapred.TaskTracker:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not findtaskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.outin any of the configured local directories2009-03-25 18:37:49,480 INFO org.apache.hadoop.mapred.TaskTracker:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not findtaskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.outin any of the configured local directories2009-03-25 18:37:51,481 INFO org.apache.hadoop.mapred.TaskTracker:attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551at 0.87 MB/s) >2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner:attempt_200903212204_0005_r_000001_1 Child Error2009-03-25 18:37:54,497 INFO org.apache.hadoop.mapred.TaskTracker:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not findtaskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.outin any of the configured local directories2009-03-25 18:37:57,400 INFO org.apache.hadoop.mapred.TaskRunner:attempt_200903212204_0005_r_000001_1 done; removing files.2009-03-25 18:42:25,191 INFO org.apache.hadoop.mapred.TaskTracker:LaunchTaskAction (registerTask): attempt_200903212204_0005_r_000001_1 task'sstate:FAILED_UNCLEAN2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: Trying tolaunch : attempt_200903212204_0005_r_000001_12009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: InTaskLauncher, current free slots : 1 and trying to launchattempt_200903212204_0005_r_000001_12009-03-25 18:42:30,134 INFO org.apache.hadoop.mapred.TaskTracker: JVM withID: jvm_200903212204_0005_r_437314552 given task:attempt_200903212204_0005_r_000001_12009-03-25 18:42:30,196 INFO org.apache.hadoop.mapred.TaskTracker:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not findtaskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.outin any of the configured local directories2009-03-25 18:42:32,530 INFO org.apache.hadoop.mapred.TaskTracker:attempt_200903212204_0005_r_000001_1 0.0%2009-03-25 18:42:32,555 INFO org.apache.hadoop.mapred.TaskTracker:attempt_200903212204_0005_r_000001_1 0.0% cleanup2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: Taskattempt_200903212204_0005_r_000001_1 is done.2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: reportedoutput size for attempt_200903212204_0005_r_000001_1 was 02009-03-25 18:42:32,568 INFO org.apache.hadoop.mapred.TaskRunner:attempt_200903212204_0005_r_000001_1 done; removing files.



grep the jobtracker for the same task

2009-03-25 18:37:54,500 INFO org.apache.hadoop.mapred.TaskInProgress: Errorfrom attempt_200903212204_0005_r_000001_1: java.io.IOException: Task processexit with nonzero status of 255.2009-03-25 18:42:25,186 INFO org.apache.hadoop.mapred.JobTracker: Addingtask (cleanup)'attempt_200903212204_0005_r_000001_1' to tiptask_200903212204_0005_r_000001, for tracker'tracker_server-1:localhost.localdomain/127.0.0.1:38816'2009-03-25 18:42:32,589 INFO org.apache.hadoop.mapred.JobTracker: Removedcompleted task 'attempt_200903212204_0005_r_000001_1' from'tracker_server-1:localhost.localdomain/127.0.0.1:38816'

"Amar Kamat" <ama...@yahoo-inc.com> wrote inmessage news:49cafd8e.8010...@yahoo-inc.com...

Amareshwari Sriramadasu wrote:
Set mapred.jobtracker.retirejob.interval
This is used to retire completed jobs.
and mapred.userlog.retain.hours to higher value.
This is used to discard user logs.
By default, their values are 24 hours. These might be the reason forfailure, though I'm not sure.
Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that after24 hours all
active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging themagain.
What is the state of the reducer (copy or sort)? Checkjobtracker/task-tracker logs to see what is the state of these reducersand whether it issued a kill signal. Either jobtracker/tasktracker isissuing a kill signal or the reducers are committing suicide. Were thereany failures on the reducer side while pulling the map output? Also whatis the nature of the job? How fast the maps finish?
Amar
Billy

Re: reduce task failing after 24 hours waiting

Reply via email to