[jira] [Work logged] (HDDS-1709) TestScmSafeNode is flaky

ASF GitHub Bot (JIRA) Mon, 24 Jun 2019 16:53:10 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-1709?focusedWorklogId=266220&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-266220
 ]


ASF GitHub Bot logged work on HDDS-1709:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Jun/19 23:51
            Start Date: 24/Jun/19 23:51
    Worklog Time Spent: 10m 
      Work Description: bharatviswa504 commented on issue #993: HDDS-1709. 
TestScmSafeNode is flaky
URL: https://github.com/apache/hadoop/pull/993#issuecomment-505223323
 
 
   Thank You @elek for reporting and fixing the issue.
   +1 LGTM. Can you take care of checkstyle issue during the commit?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 266220)
    Time Spent: 0.5h  (was: 20m)

> TestScmSafeNode is flaky
> ------------------------
>
>                 Key: HDDS-1709
>                 URL: https://issues.apache.org/jira/browse/HDDS-1709
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM, test
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode is failed at last 
> night with the following error:
> {code:java}
> java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> org.junit.Assert.assertTrue(Assert.java:52) at 
> org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode(TestScmSafeMode.java:285)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74){code}
> Locally it can be tested but it's very easy to reproduce by adding an 
> additional sleep DataNodeSafeModeRule:
> {code:java}
> +++ 
> b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java
> @@ -63,7 +63,11 @@ protected boolean validate() {
>  
>    @Override
>    protected void process(NodeRegistrationContainerReport reportsProto) {
> -
> +    try {
> +      Thread.sleep(3000);
> +    } catch (InterruptedException e) {
> +      e.printStackTrace();
> +    }{code}
> This is a clear race condition:
> DatanodeSafeModeRule and ContainerSafeModeRule are processing the same events 
> but it can be possible (in case of an accidental sleep) that the container 
> safe mode rule is done, but DatanodeSafeModeRule didn't process the new event 
> (yet).
> As a result the test execution will continue:
> {code:java}
> GenericTestUtils
>     .waitFor(() -> scm.getCurrentContainerThreshold() == 1.0, 100, 20000);
> {code}
> (This line is waiting ONLY for the ContainerSafeModeRule).
> The fix is easy, let's wait for the processing of all the async events:
> {code:java}
> EventQueue eventQueue =
>     (EventQueue) cluster.getStorageContainerManager().getEventQueue();
> eventQueue.processAll(5000L);{code}
> As we are sure that the events are already sent to the EventQueue (because we 
> have the previous waitFor), it should be enough.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1709) TestScmSafeNode is flaky

Reply via email to