Elek, Marton created HDDS-1709:
----------------------------------

             Summary: TestScmSafeNode is flaky
                 Key: HDDS-1709
                 URL: https://issues.apache.org/jira/browse/HDDS-1709
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
          Components: SCM, test
            Reporter: Elek, Marton
            Assignee: Elek, Marton


org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode is failed at last 
night with the following error:
{code:java}
java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at 
org.junit.Assert.assertTrue(Assert.java:41) at 
org.junit.Assert.assertTrue(Assert.java:52) at 
org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode(TestScmSafeMode.java:285)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74){code}
Locally it can be tested but it's very easy to reproduce by adding an 
additional sleep DataNodeSafeModeRule:
{code:java}
+++ 
b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java
@@ -63,7 +63,11 @@ protected boolean validate() {
 
   @Override
   protected void process(NodeRegistrationContainerReport reportsProto) {
-
+    try {
+      Thread.sleep(3000);
+    } catch (InterruptedException e) {
+      e.printStackTrace();
+    }{code}
This is a clear race condition:

DatanodeSafeModeRule and ContainerSafeModeRule are processing the same events 
but it can be possible (in case of an accidental sleep) that the container safe 
mode rule is done, but DatanodeSafeModeRule didn't process the new event (yet).

As a result the test execution will continue:
{code:java}
GenericTestUtils
    .waitFor(() -> scm.getCurrentContainerThreshold() == 1.0, 100, 20000);
{code}
(This line is waiting ONLY for the ContainerSafeModeRule).

The fix is easy, let's wait for the processing of all the async events:
{code:java}
EventQueue eventQueue =
    (EventQueue) cluster.getStorageContainerManager().getEventQueue();
eventQueue.processAll(5000L);{code}
As we are sure that the events are already sent to the EventQueue (because we 
have the previous waitFor), it should be enough.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to