[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191211#comment-15191211
 ] 

Eric Badger commented on YARN-4686:
-----------------------------------

Thanks for finding this failure, [~eepayne]. 

I narrowed down the issue to being a race condition between the MiniYARNCluster 
being completely started and the reservation being placed via the test. When 
the CapacityScheduler starts up, it creates a a PlanFollower (via 
startPlanFollower()). The thread created by startPlanFollower() executes the 
synchronizePlan() function in a loop. The main test code in 
TestYarnClient#testReservationAPIs is running in a different thread and calls 
submitReservation (TetsYarnClient.java:1213) once the cluster is up and 
running. The race is between the synchronizePlan thread calling 
plan.setTotalCapacity (indirectly through CapacityScheduler.java:137) and the 
submitReservation thread calling plan.getTotalCapacity (indirectly through 
ReservationInputValidator.java:148). 

The patch that I submitted before makes sure that the MiniYARNCluster won't 
return until the CapacityScheduler has registered all of the nodes, but it 
doesn't wait for the totalCapacity to be set to the correct value. Is there a 
good way to make sure that the cluster won't start until the scheduler has 
totalCapacity set to the correct value?

> MiniYARNCluster.start() returns before cluster is completely started
> --------------------------------------------------------------------
>
>                 Key: YARN-4686
>                 URL: https://issues.apache.org/jira/browse/YARN-4686
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith Sharma K S
>            Assignee: Eric Badger
>         Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, 
> YARN-4686.002.patch, YARN-4686.003.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to