[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191211#comment-15191211 ]
Eric Badger commented on YARN-4686: ----------------------------------- Thanks for finding this failure, [~eepayne]. I narrowed down the issue to being a race condition between the MiniYARNCluster being completely started and the reservation being placed via the test. When the CapacityScheduler starts up, it creates a a PlanFollower (via startPlanFollower()). The thread created by startPlanFollower() executes the synchronizePlan() function in a loop. The main test code in TestYarnClient#testReservationAPIs is running in a different thread and calls submitReservation (TetsYarnClient.java:1213) once the cluster is up and running. The race is between the synchronizePlan thread calling plan.setTotalCapacity (indirectly through CapacityScheduler.java:137) and the submitReservation thread calling plan.getTotalCapacity (indirectly through ReservationInputValidator.java:148). The patch that I submitted before makes sure that the MiniYARNCluster won't return until the CapacityScheduler has registered all of the nodes, but it doesn't wait for the totalCapacity to be set to the correct value. Is there a good way to make sure that the cluster won't start until the scheduler has totalCapacity set to the correct value? > MiniYARNCluster.start() returns before cluster is completely started > -------------------------------------------------------------------- > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test > Reporter: Rohith Sharma K S > Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, > YARN-4686.002.patch, YARN-4686.003.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)