[ 
https://issues.apache.org/jira/browse/HDDS-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer updated HDDS-1709:
-------------------------------
       Resolution: Fixed
    Fix Version/s: 0.4.1
           Status: Resolved  (was: Patch Available)

[~elek] Thanks for the contribution. I  have committed this patch to the trunk. 
[~bharatviswa] Thanks for the review.

> TestScmSafeNode is flaky
> ------------------------
>
>                 Key: HDDS-1709
>                 URL: https://issues.apache.org/jira/browse/HDDS-1709
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM, test
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.4.1
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode is failed at last 
> night with the following error:
> {code:java}
> java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> org.junit.Assert.assertTrue(Assert.java:52) at 
> org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode(TestScmSafeMode.java:285)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74){code}
> Locally it can be tested but it's very easy to reproduce by adding an 
> additional sleep DataNodeSafeModeRule:
> {code:java}
> +++ 
> b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java
> @@ -63,7 +63,11 @@ protected boolean validate() {
>  
>    @Override
>    protected void process(NodeRegistrationContainerReport reportsProto) {
> -
> +    try {
> +      Thread.sleep(3000);
> +    } catch (InterruptedException e) {
> +      e.printStackTrace();
> +    }{code}
> This is a clear race condition:
> DatanodeSafeModeRule and ContainerSafeModeRule are processing the same events 
> but it can be possible (in case of an accidental sleep) that the container 
> safe mode rule is done, but DatanodeSafeModeRule didn't process the new event 
> (yet).
> As a result the test execution will continue:
> {code:java}
> GenericTestUtils
>     .waitFor(() -> scm.getCurrentContainerThreshold() == 1.0, 100, 20000);
> {code}
> (This line is waiting ONLY for the ContainerSafeModeRule).
> The fix is easy, let's wait for the processing of all the async events:
> {code:java}
> EventQueue eventQueue =
>     (EventQueue) cluster.getStorageContainerManager().getEventQueue();
> eventQueue.processAll(5000L);{code}
> As we are sure that the events are already sent to the EventQueue (because we 
> have the previous waitFor), it should be enough.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to