[ https://issues.apache.org/jira/browse/YARN-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119268#comment-15119268 ]
Sunil G commented on YARN-4615: ------------------------------- {noformat} // AM crashes, and a new app-attempt gets created node.nodeHeartbeat(applicationAttemptOneID, 1, ContainerState.COMPLETE); rm.waitForState(node, am1ContainerID, RMContainerState.COMPLETED); RMAppAttempt rmAppAttempt2 = MockRM.waitForAttemptScheduled(rmApp, rm); {noformat} Above code snippet is from test case mentioned in JIRA title. And {{MockRM.waitForAttemptScheduled}} has reported the wrong state pblm. In above line {{rm.waitForState}}, AM container state is verified whether its COMPLETED. And waitForAttemptScheduled tries to wait till next attempt is SCHEDULED. However this goes to ALLOCATED (an extra node heartbeat might have reached and pushed the container to be allocated). If we see {{rm.waitForState}}, it sends nodeHeartbeat if state is not correct (while waiting). And this is not needed as we already send a heartbeat with container completed details. I suspect that {{RMContainerState.COMPLETED}} was not reached for Am container when state was verified in {{rm.waitForState}}. And one extra heartbeat is sent from this method. I will upload a patch with a new {{rm.waitForState}} which doesnt send nodeHeartBeat, rather it will only wait till timeout happens. [~rohithsharma] pls share your thoughts. > TestAbstractYarnScheduler#testResourceRequestRecoveryToTheRightAppAttempt > fails occasionally > -------------------------------------------------------------------------------------------- > > Key: YARN-4615 > URL: https://issues.apache.org/jira/browse/YARN-4615 > Project: Hadoop YARN > Issue Type: Sub-task > Components: test > Reporter: Jason Lowe > > Sometimes > TestAbstractYarnScheduler#testResourceRequestRecoveryToTheRightAppAttempt > will fail like this: > {noformat} > org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler > testResourceRequestRecoveryToTheRightAppAttempt[1](org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler) > Time elapsed: 77.427 sec <<< FAILURE! > java.lang.AssertionError: Attempt state is not correct (timedout): expected: > SCHEDULED actual: ALLOCATED for the application attempt > appattempt_1453254869107_0001_000002 > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:197) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:172) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForAttemptScheduled(MockRM.java:831) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler.testResourceRequestRecoveryToTheRightAppAttempt(TestAbstractYarnScheduler.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)