[jira] [Work logged] (HDDS-1454) GC other system pause events can trigger pipeline destroy for all the nodes in the cluster

ASF GitHub Bot (JIRA) Tue, 18 Jun 2019 10:05:20 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-1454?focusedWorklogId=262478&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-262478
 ]


ASF GitHub Bot logged work on HDDS-1454:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Jun/19 17:04
            Start Date: 18/Jun/19 17:04
    Worklog Time Spent: 10m 
      Work Description: supratimdeka commented on pull request #852: HDDS-1454. 
GC other system pause events can trigger pipeline destroy for all the nodes in 
the cluster. Contributed by Supratim Deka
URL: https://github.com/apache/hadoop/pull/852#discussion_r294930181
 
 

 ##########
 File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeStateManager.java
 ##########
 @@ -558,41 +619,40 @@ public void run() {
           heartbeatCheckerIntervalMs);
     }
 
-    // we purposefully make this non-deterministic. Instead of using a
-    // scheduleAtFixedFrequency  we will just go to sleep
-    // and wake up at the next rendezvous point, which is currentTime +
-    // heartbeatCheckerIntervalMs. This leads to the issue that we are now
-    // heart beating not at a fixed cadence, but clock tick + time taken to
-    // work.
-    //
-    // This time taken to work can skew the heartbeat processor thread.
-    // The reason why we don't care is because of the following reasons.
-    //
-    // 1. checkerInterval is general many magnitudes faster than datanode HB
-    // frequency.
-    //
-    // 2. if we have too much nodes, the SCM would be doing only HB
-    // processing, this could lead to SCM's CPU starvation. With this
-    // approach we always guarantee that  HB thread sleeps for a little while.
-    //
-    // 3. It is possible that we will never finish processing the HB's in the
-    // thread. But that means we have a mis-configured system. We will warn
-    // the users by logging that information.
-    //
-    // 4. And the most important reason, heartbeats are not blocked even if
-    // this thread does not run, they will go into the processing queue.
+  }
+
+  private void scheduleNextHealthCheck() {
 
     if (!Thread.currentThread().isInterrupted() &&
         !executorService.isShutdown()) {
       //BUGBUG: The return future needs to checked here to make sure the
       // exceptions are handled correctly.
-      executorService.schedule(this, heartbeatCheckerIntervalMs,
-          TimeUnit.MILLISECONDS);
+      healthCheckFuture = executorService.schedule(this,
+          heartbeatCheckerIntervalMs, TimeUnit.MILLISECONDS);
     } else {
-      LOG.info("Current Thread is interrupted, shutting down HB processing " +
+      LOG.warn("Current Thread is interrupted, shutting down HB processing " +
           "thread for Node Manager.");
     }
 
+    lastHealthCheck = Time.monotonicNow();
+  }
+
+  /**
+   * if the time since last check exceeds the stale|dead node interval, skip.
+   * such long delays might be caused by a JVM pause. SCM cannot make reliable
+   * conclusions about datanode health in such situations.
+   * @return : true indicates skip HB checks
+   */
+  private boolean shouldSkipCheck() {
+
+    long currentTime = Time.monotonicNow();
+    long minInterval = Math.min(staleNodeIntervalMs, deadNodeIntervalMs);
+
+    if ((currentTime - lastHealthCheck) >= minInterval) {
+      return true;
+    }
+
+    return false;
 
 Review comment:
   done
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 262478)
    Time Spent: 1h 10m  (was: 1h)

> GC other system pause events can trigger pipeline destroy for all the nodes 
> in the cluster
> ------------------------------------------------------------------------------------------
>
>                 Key: HDDS-1454
>                 URL: https://issues.apache.org/jira/browse/HDDS-1454
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Mukul Kumar Singh
>            Assignee: Supratim Deka
>            Priority: Major
>              Labels: MiniOzoneChaosCluster, pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In a MiniOzoneChaosCluster run it was observed that events like GC pauses or 
> any other pauses in SCM can mark all the datanodes as stale in SCM. This will 
> trigger multiple pipeline destroy and will render the system unusable. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1454) GC other system pause events can trigger pipeline destroy for all the nodes in the cluster

Reply via email to