[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494674#comment-13494674 ]
Hudson commented on MAPREDUCE-4774: ----------------------------------- Integrated in Hadoop-Mapreduce-trunk #1253 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1253/]) MAPREDUCE-4774. JobImpl does not handle asynchronous task events in FAILED state (jlowe via bobby) (Revision 1407679) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407679 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java > JobImpl does not handle asynchronous task events in FAILED state > ---------------------------------------------------------------- > > Key: MAPREDUCE-4774 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.3, 2.0.1-alpha > Reporter: Ivan A. Veselovsky > Assignee: Jason Lowe > Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 > > Attachments: MAPREDUCE-4774.patch > > > The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently > fails in mapred build (e.g. see > https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ > , or > https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/). > The test aims to check Job status notifications received through HTTP > Servlet. It runs 3 jobs: successfull, killed, and failed. > The test expects the servlet to receive some expected notifications in some > expected order. It also tries to test the retry-on-failure notification > functionality, so on each 1st notification the servlet answers "400 forcing > error", and on each 2nd notification attempt it answers "ok". > In general, the test fails because the actual number and/or type of the > notifications differs from the expected. > Investigation shows that actual root cause of the problem is an incorrect job > state transition: the 3rd job mapred task fails (by intentionally thrown > RuntimeException, see UtilsForTests#runJobFail()), and the state of the task > changes from RUNNING to FAILED. > At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in > method > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, > TaskAttemptCompletionEventStatus)), and this event gets processed in > AsyncDispatcher, but this transition is impossible according to the event > transition map (see > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). > This causes the following exception to be thrown upon the event processing: > 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event > at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > JOB_TASK_ATTEMPT_COMPLETED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79) > at java.lang.Thread.run(Thread.java:662) > So, the job gets into state "INTERNAL_ERROR", the job end notification like > this is sent: > http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR > > (here we can see "ERROR" status instead of "FAILED") > After that the notification servlet receives either only "ERROR" > notification, or one more notification "ERROR" after "FAILED", which finally > causes the test to fail. (Some variation in the test behavior caused by > racing conditions because there are many asynchronous processings there, and > the test is flaky, in fact). > In any way, it looks like the root cause of the problem is the possibility of > the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at > FAILED". > Need an expert advice on how that should be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira