Peter Bacsko created YARN-9436: ---------------------------------- Summary: Flaky test testApplicationLifetimeMonitor Key: YARN-9436 URL: https://issues.apache.org/jira/browse/YARN-9436 Project: Hadoop YARN Issue Type: Bug Components: scheduler, test Reporter: Peter Bacsko Assignee: Peter Bacsko
In our test environment, we occasionally encounter this failure: {noformat} 2019-04-03 12:49:32 [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.535 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor 2019-04-03 12:53:08 [ERROR] testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) Time elapsed: 34.244 s <<< FAILURE! 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before lifetime value 2019-04-03 12:53:08 at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) 2019-04-03 12:53:08 {noformat} The root cause is the condition here: {noformat} Assert.assertTrue("Application killed before lifetime value", totalTimeRun > maxLifetime); {noformat} However, there are two problems with this condition: 1. Logically it's not correct. In fact, since the app should be killed after 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up being 31. 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 30, but this is correct, because in {{setUpCSQueue}} we set the queue lifetime: {noformat} csConf.setMaximumLifetimePerQueue( CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); csConf.setDefaultLifetimePerQueue( CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); {noformat} A more proper condition is: {noformat} Assert.assertTrue("Application killed before lifetime value", totalTimeRun >= maxLifetime); {noformat} The assertion message in the next line is also misleading: {noformat} Assert.assertTrue( "Application killed before lifetime value " + totalTimeRun, totalTimeRun < maxLifetime + 10L); {noformat} If it false, it means that the application is killed _after_ 40 seconds, which exceeds both the app's lifetime (40s) and that of the queue (30s). {noformat} Assert.assertTrue( "Application killed after queue/app lifetime value: " + totalTimeRun, totalTimeRun < maxLifetime + 10L); {noformat} We can be even be stricter, since we expect a kill almost immediately after 30 seconds: {noformat} Assert.assertTrue( "Application killed too late: " + totalTimeRun, totalTimeRun < maxLifetime + 2L); {noformat} where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org