[ https://issues.apache.org/jira/browse/HDDS-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HDDS-1709: --------------------------------- Labels: pull-request-available (was: ) > TestScmSafeNode is flaky > ------------------------ > > Key: HDDS-1709 > URL: https://issues.apache.org/jira/browse/HDDS-1709 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: SCM, test > Reporter: Elek, Marton > Assignee: Elek, Marton > Priority: Critical > Labels: pull-request-available > > org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode is failed at last > night with the following error: > {code:java} > java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at > org.junit.Assert.assertTrue(Assert.java:41) at > org.junit.Assert.assertTrue(Assert.java:52) at > org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode(TestScmSafeMode.java:285) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74){code} > Locally it can be tested but it's very easy to reproduce by adding an > additional sleep DataNodeSafeModeRule: > {code:java} > +++ > b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java > @@ -63,7 +63,11 @@ protected boolean validate() { > > @Override > protected void process(NodeRegistrationContainerReport reportsProto) { > - > + try { > + Thread.sleep(3000); > + } catch (InterruptedException e) { > + e.printStackTrace(); > + }{code} > This is a clear race condition: > DatanodeSafeModeRule and ContainerSafeModeRule are processing the same events > but it can be possible (in case of an accidental sleep) that the container > safe mode rule is done, but DatanodeSafeModeRule didn't process the new event > (yet). > As a result the test execution will continue: > {code:java} > GenericTestUtils > .waitFor(() -> scm.getCurrentContainerThreshold() == 1.0, 100, 20000); > {code} > (This line is waiting ONLY for the ContainerSafeModeRule). > The fix is easy, let's wait for the processing of all the async events: > {code:java} > EventQueue eventQueue = > (EventQueue) cluster.getStorageContainerManager().getEventQueue(); > eventQueue.processAll(5000L);{code} > As we are sure that the events are already sent to the EventQueue (because we > have the previous waitFor), it should be enough. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org