There is many maps finshing from 4 mins to 15 mins less time closer to the
end of the jobs so no timeout there. The state of the reduce task is Shuffle
there grabing the map task as they finsh. the current job took 50:43:37 each
of the reduce task failed twice in that time once at 24 hours in and second
at 48 hours in. I will test on the next run in a few days the settings
mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to 72
hours and see if that solves the problem. So not a bad gess thought seams
odd within 5 min's of 24 hours both times on all the task at the same time.
looks like from the tasktracker logs I get the WARN below
org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1
Child Error
grep the tasktracker log for one of the reduce that failed I do not have
debug turned on so all I got is the info logs
2009-03-25 18:37:45,473 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551
at 0.87 MB/s) >
2009-03-25 18:37:48,476 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551
at 0.87 MB/s) >
2009-03-25 18:37:49,194 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
in any of the configured local directories
2009-03-25 18:37:49,480 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
in any of the configured local directories
2009-03-25 18:37:51,481 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551
at 0.87 MB/s) >
2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_200903212204_0005_r_000001_1 Child Error
2009-03-25 18:37:54,497 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
in any of the configured local directories
2009-03-25 18:37:57,400 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200903212204_0005_r_000001_1 done; removing files.
2009-03-25 18:42:25,191 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_200903212204_0005_r_000001_1 task's
state:FAILED_UNCLEAN
2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
launch : attempt_200903212204_0005_r_000001_1
2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_200903212204_0005_r_000001_1
2009-03-25 18:42:30,134 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
ID: jvm_200903212204_0005_r_437314552 given task:
attempt_200903212204_0005_r_000001_1
2009-03-25 18:42:30,196 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
in any of the configured local directories
2009-03-25 18:42:32,530 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903212204_0005_r_000001_1 0.0%
2009-03-25 18:42:32,555 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903212204_0005_r_000001_1 0.0% cleanup
2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_200903212204_0005_r_000001_1 is done.
2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_200903212204_0005_r_000001_1 was 0
2009-03-25 18:42:32,568 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200903212204_0005_r_000001_1 done; removing files.
grep the jobtracker for the same task
2009-03-25 18:37:54,500 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_200903212204_0005_r_000001_1: java.io.IOException: Task process
exit with nonzero status of 255.
2009-03-25 18:42:25,186 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (cleanup)'attempt_200903212204_0005_r_000001_1' to tip
task_200903212204_0005_r_000001, for tracker
'tracker_server-1:localhost.localdomain/127.0.0.1:38816'
2009-03-25 18:42:32,589 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200903212204_0005_r_000001_1' from
'tracker_server-1:localhost.localdomain/127.0.0.1:38816'
"Amar Kamat" <ama...@yahoo-inc.com> wrote in
message news:49cafd8e.8010...@yahoo-inc.com...
Amareshwari Sriramadasu wrote:
Set mapred.jobtracker.retirejob.interval
This is used to retire completed jobs.
and mapred.userlog.retain.hours to higher value.
This is used to discard user logs.
By default, their values are 24 hours. These might be the reason for
failure, though I'm not sure.
Thanks
Amareshwari
Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that after
24 hours all
active reduce task fail with the error messages
java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
Is there something in the config that I can change to stop this?
Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them
again.
What is the state of the reducer (copy or sort)? Check
jobtracker/task-tracker logs to see what is the state of these reducers
and whether it issued a kill signal. Either jobtracker/tasktracker is
issuing a kill signal or the reducers are committing suicide. Were there
any failures on the reducer side while pulling the map output? Also what
is the nature of the job? How fast the maps finish?
Amar
Billy