Ivan A. Veselovsky created MAPREDUCE-4774:
---------------------------------------------
Summary: repair test
org.apache.hadoop.mapred.TestClusterMRNotification.testMR
Key: MAPREDUCE-4774
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Ivan A. Veselovsky
The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently
fails in mapred build (e.g. see
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/
, or
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
The test aims to check Job status notifications received through HTTP Servlet.
It runs 3 jobs: successfull, killed, and failed.
The test expects the servlet to receive some expected notifications in some
expected order. It also tries to test the retry-on-failure notification
functionality, so on each 1st notification the servlet answers "400 forcing
error", and on each 2nd notification attempt it answers "ok".
In general, the test fails because the actual number and/or type of the
notifications differs from the expected.
Investigation shows that actual root cause of the problem is an incorrect job
state transition: the 3rd job mapred task fails (by intentionally thrown
RuntimeException, see UtilsForTests#runJobFail()), and the state of the task
changes from RUNNING to FAILED.
At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in
method
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId,
TaskAttemptCompletionEventStatus)), and this event gets processed in
AsyncDispatcher, but this transition is impossible according to the event
transition map (see
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This
causes the following exception to be thrown upon the event processing:
2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at
current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
JOB_TASK_ATTEMPT_COMPLETED at FAILED
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
at
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
at
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
at java.lang.Thread.run(Thread.java:662)
So, the job gets into state "INTERNAL_ERROR", the job end notification like
this is sent:
http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
(here we can see "ERROR" status instead of "FAILED")
After that the notification servlet receives either only "ERROR" notification,
or one more notification "ERROR" after "FAILED", which finally causes the test
to fail. (Some variation in the test behavior caused by racing conditions
because there are many asynchronous processings there, and the test is flaky,
in fact).
In any way, it looks like the root cause of the problem is the possibility of
the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira