Re: [PR] HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode [ozone]

via GitHub Wed, 28 Jan 2026 10:33:57 -0800


sumitagrawl commented on code in PR #9651:
URL: https://github.com/apache/ozone/pull/9651#discussion_r2737967493



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java:
##########
@@ -123,81 +117,62 @@ protected synchronized boolean validate() {
       LOG.info("All SCM pipelines are closed due to ongoing upgrade " +
           "finalization. Bypassing healthy pipeline safemode rule.");
       return true;
+    }
+    // Query PipelineManager directly for healthy pipeline count

Review Comment:
   In current implementation, if Rule is satisfied once, its marked exit. Never 
enter again. But with having dynamic rule check, one time may be success and 
other time may be failure.
   This may not satisfy the purpose.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java:
##########
@@ -123,81 +117,62 @@ protected synchronized boolean validate() {
       LOG.info("All SCM pipelines are closed due to ongoing upgrade " +
           "finalization. Bypassing healthy pipeline safemode rule.");
       return true;
+    }
+    // Query PipelineManager directly for healthy pipeline count
+    List<Pipeline> openPipelines = pipelineManager.getPipelines(
+        RatisReplicationConfig.getInstance(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN);
+    
+    LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
+    currentHealthyPipelineCount = (int) openPipelines.stream()
+        .filter(this::isPipelineHealthy)

Review Comment:
   Earlier, for every pipeline event, pipeline info is logged once. But now it 
will be logged for all pipeline for every event, can result in a lot of INFO 
logs for pipeline only.
   Also will be triggered for EC pipeline type to be logged which is not logged 
earlier.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java:
##########
@@ -123,81 +117,62 @@ protected synchronized boolean validate() {
       LOG.info("All SCM pipelines are closed due to ongoing upgrade " +
           "finalization. Bypassing healthy pipeline safemode rule.");
       return true;
+    }
+    // Query PipelineManager directly for healthy pipeline count
+    List<Pipeline> openPipelines = pipelineManager.getPipelines(
+        RatisReplicationConfig.getInstance(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN);
+    
+    LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
+    currentHealthyPipelineCount = (int) openPipelines.stream()
+        .filter(this::isPipelineHealthy)
+        .count();
+    
getSafeModeMetrics().setNumCurrentHealthyPipelines(currentHealthyPipelineCount);
+    boolean isValid = currentHealthyPipelineCount >= 
healthyPipelineThresholdCount;
+    if (scmInSafeMode()) {
+      LOG.info("SCM in safe mode. Healthy pipelines: {}, threshold: {}, rule 
satisfied: {}",
+          currentHealthyPipelineCount, healthyPipelineThresholdCount, isValid);
     } else {
-      return currentHealthyPipelineCount >= healthyPipelineThresholdCount;
+      LOG.debug("SCM not in safe mode. Healthy pipelines: {}, threshold: {}",
+          currentHealthyPipelineCount, healthyPipelineThresholdCount);
     }
+    return isValid;
   }
 
-  @Override
-  protected synchronized void process(Pipeline pipeline) {
-    Objects.requireNonNull(pipeline, "pipeline == null");
-
-    // When SCM is in safe mode for long time, already registered
-    // datanode can send pipeline report again, or SCMPipelineManager will
-    // create new pipelines.
-
-    // Only handle RATIS + 3-replica pipelines.
-    if (pipeline.getType() != HddsProtos.ReplicationType.RATIS ||
-        ((RatisReplicationConfig) 
pipeline.getReplicationConfig()).getReplicationFactor() !=
-            HddsProtos.ReplicationFactor.THREE) {
-      Logger safeModeManagerLog = SCMSafeModeManager.getLogger();
-      if (safeModeManagerLog.isDebugEnabled()) {
-        safeModeManagerLog.debug("Skipping pipeline safemode report processing 
as Replication type isn't RATIS " +
-            "or replication factor isn't 3.");
-      }
-      return;
-    }
-
-    // Skip already processed ones.
-    if (processedPipelineIDs.contains(pipeline.getId())) {
-      LOG.info("Skipping pipeline safemode report processing check as 
pipeline: {} is already recorded.",
-          pipeline.getId());
-      return;
+  private boolean isPipelineHealthy(Pipeline pipeline) {
+    // Verify pipeline has all 3 nodes
+    List<DatanodeDetails> nodes = pipeline.getNodes();
+    if (nodes.size() != 3) {
+      LOG.info("Pipeline {} is not healthy: has {} nodes instead of 3",

Review Comment:
   this logging will happen now **3 times** for every pipeline event, 2 times 
in validate() as called pre-process() and post-process(), and one time in 
getStatus(). And this will be done for every pipeline in the SCM for each event.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode [ozone]

Reply via email to