mapred.jobtracker.retirejob.interval killing long running reduce task
---------------------------------------------------------------------
Key: HADOOP-5591
URL: https://issues.apache.org/jira/browse/HADOOP-5591
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.19.2
Environment: 0.19.2-dev, r753365
Reporter: Billy Pearson
I have long running jobs that run 30-50 hours I run from time to time . I
noticed the reduce jobs getting a WARN child error and failing every 24 hours
while in the Shuffle stage.
I modify the setting per suggestion on the user-list of setting
mapred.jobtracker.retirejob.interval and changed it from 24 hours to 72 and the
problem went away on the next 30 hour job.
I seen a reduce task run for longer then the 24 hours but only if it does not
stay in the Shuffle stage or the Sort stage for longer then 24 hours.
I have seen the same error from faild task that reamin in the Shuffle or Sort
Stage for longer then 24 hours.
the error I get form the jobtracker gui is this
java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
the error I get on the tasktracker logs is this:
2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_200903212204_0005_r_000001_1 Child Error
Then clean up happens and a reduce task is launched again to try again.
I am not 100% sure what the setting mapred.jobtracker.retirejob.interval does
but I would not thank any setting would kill a actively NOT idle Sorting or
Shuffle task
also someone on the list ask about my maps if they where long running also they
are not long running average 4 mins completion time a map.
Also mapred.jobtracker.retirejob.interval is not in the default config but the
code looks for it there when setting it in the code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.