There is many maps finshing from 4 mins to 15 mins less time closer to the end of the jobs so no timeout there. The state of the reduce task is Shuffle there grabing the map task as they finsh. the current job took 50:43:37 each of the reduce task failed twice in that time once at 24 hours in and second at 48 hours in. I will test on the next run in a few days the settings mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to 72 hours and see if that solves the problem. So not a bad gess thought seams odd within 5 min's of 24 hours both times on all the task at the same time.

looks like from the tasktracker logs I get the WARN below org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1 Child Error


grep the tasktracker log for one of the reduce that failed I do not have debug turned on so all I got is the info logs

2009-03-25 18:37:45,473 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551 at 0.87 MB/s) > 2009-03-25 18:37:48,476 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551 at 0.87 MB/s) > 2009-03-25 18:37:49,194 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out in any of the configured local directories 2009-03-25 18:37:49,480 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out in any of the configured local directories 2009-03-25 18:37:51,481 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551 at 0.87 MB/s) > 2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1 Child Error 2009-03-25 18:37:54,497 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out in any of the configured local directories 2009-03-25 18:37:57,400 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1 done; removing files. 2009-03-25 18:42:25,191 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200903212204_0005_r_000001_1 task's state:FAILED_UNCLEAN 2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200903212204_0005_r_000001_1 2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200903212204_0005_r_000001_1 2009-03-25 18:42:30,134 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200903212204_0005_r_437314552 given task: attempt_200903212204_0005_r_000001_1 2009-03-25 18:42:30,196 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out in any of the configured local directories 2009-03-25 18:42:32,530 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903212204_0005_r_000001_1 0.0% 2009-03-25 18:42:32,555 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903212204_0005_r_000001_1 0.0% cleanup 2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_200903212204_0005_r_000001_1 is done. 2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_200903212204_0005_r_000001_1 was 0 2009-03-25 18:42:32,568 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1 done; removing files.


grep the jobtracker for the same task

2009-03-25 18:37:54,500 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200903212204_0005_r_000001_1: java.io.IOException: Task process exit with nonzero status of 255. 2009-03-25 18:42:25,186 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_200903212204_0005_r_000001_1' to tip task_200903212204_0005_r_000001, for tracker 'tracker_server-1:localhost.localdomain/127.0.0.1:38816' 2009-03-25 18:42:32,589 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903212204_0005_r_000001_1' from 'tracker_server-1:localhost.localdomain/127.0.0.1:38816'






"Amar Kamat" <ama...@yahoo-inc.com> wrote in message news:49cafd8e.8010...@yahoo-inc.com...
Amareshwari Sriramadasu wrote:
Set mapred.jobtracker.retirejob.interval
This is used to retire completed jobs.
and mapred.userlog.retain.hours to higher value.
This is used to discard user logs.
By default, their values are 24 hours. These might be the reason for failure, though I'm not sure.

Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that after 24 hours all
active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them again.
What is the state of the reducer (copy or sort)? Check jobtracker/task-tracker logs to see what is the state of these reducers and whether it issued a kill signal. Either jobtracker/tasktracker is issuing a kill signal or the reducers are committing suicide. Were there any failures on the reducer side while pulling the map output? Also what is the nature of the job? How fast the maps finish?
Amar

Billy







Reply via email to