[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494377#comment-13494377
 ] 

Robert Joseph Evans commented on MAPREDUCE-4774:
------------------------------------------------

The change looks simple enough and does fix the failing test.  I am +1 p[ending 
Jenkins approval.
                
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4774
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Ivan A. Veselovsky
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently 
>  fails in mapred build (e.g. see 
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/
>  , or 
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP 
> Servlet. It runs 3 jobs: successfull, killed, and failed. 
> The test expects the servlet to receive some expected notifications in some 
> expected order. It also tries to test the retry-on-failure notification 
> functionality, so on each 1st notification the servlet answers "400 forcing 
> error", and on each 2nd notification attempt it answers "ok". 
> In general, the test fails because the actual number and/or type of the 
> notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job 
> state transition: the 3rd job mapred task fails (by intentionally thrown  
> RuntimeException, see UtilsForTests#runJobFail()), and the state of the task 
> changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in  
> method 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId,
>  TaskAttemptCompletionEventStatus)), and this event gets processed in 
> AsyncDispatcher, but this transition is impossible according to the event 
> transition map (see 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). 
> This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event 
> at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> JOB_TASK_ATTEMPT_COMPLETED at FAILED
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
>         at java.lang.Thread.run(Thread.java:662) 
> So, the job gets into state "INTERNAL_ERROR", the job end notification like 
> this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
>  
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" 
> notification, or one more notification "ERROR" after "FAILED", which finally 
> causes the test to fail. (Some variation in the test behavior caused by 
> racing conditions because there are many asynchronous processings there, and 
> the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of 
> the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at 
> FAILED". 
> Need an expert advice on how that should be fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to