Re: [PR] HDDS-13535. Flag to indicate under and over replications in container-state check [ozone]

via GitHub Sun, 30 Nov 2025 02:52:27 -0800


adoroszlai commented on code in PR #9135:
URL: https://github.com/apache/ozone/pull/9135#discussion_r2573541401



##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long 
containerId)
     // Cache miss - fetch and store
     ContainerInfo info = containerOperationClient.getContainer(containerId);
     String encodeToken = 
containerOperationClient.getEncodedContainerToken(containerId);
-    cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+    cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
     encodedTokenCache.put(containerId, cachedData);
     return cachedData;
   }
 
+  private String checkReplicationStatus(long containerId) {
+    try {
+      ContainerInfoToken token = getContainerInfoToken(containerId);
+      ContainerInfo containerInfo = token.getContainerInfo();
+      
+      // Get container replicas from SCM
+      List<ContainerReplicaInfo> replicaInfos = 
+          containerOperationClient.getContainerReplicas(containerId);
+      
+      if (replicaInfos.isEmpty()) {
+        return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": 
no replicas found";
+      }
+      
+      int replicationFactor = containerInfo.getReplicationFactor().getNumber();
+      int healthyReplicas = 0;
+      
+      for (ContainerReplicaInfo replicaInfo : replicaInfos) {
+        if (!replicaInfo.getState().equals("UNHEALTHY")) {
+          healthyReplicas++;
+        }
+      }
+      
+      if (healthyReplicas < replicationFactor) {
+        return String.format("%s: %d/%d healthy replicas", 
+            ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED, 
healthyReplicas, replicationFactor);
+      } else if (healthyReplicas > replicationFactor) {
+        return String.format("%s: %d/%d healthy replicas", 
+            ContainerHealthResult.ReplicationStatus.OVER_REPLICATED, 
healthyReplicas, replicationFactor);
+      } else {
+        return 
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
+      }
+      
+    } catch (Exception e) {
+      return "REPLICATION_CHECK_FAILED: " + e.getMessage();
+    }
+  }
+
   private static class ContainerInfoToken {
     private HddsProtos.LifeCycleState state;
     private final String encodedToken;
+    private final ContainerInfo containerInfo;

Review Comment:
   `ContainerInfoToken` is stored in `encodedTokenCache`, with default cache 
size of 1 million.  `ContainerInfo` is a much larger object, so the process may 
run out of memory.  The new logic only requires the number of nodes in addition 
to existing items, so storing the complete `ContainerInfo` object is 
unnecessary.  But I don't think even "number of required nodes" is the right 
information to cache (see my comment on `getContainerReplicas`).
   
   Also, expected memory requirement of the cache is mentioned in CLI help.  It 
should be updated to reflect the increased size:
   
   
https://github.com/apache/ozone/blob/8737837d5c8493bf39ba36e0e01ad70aab531339/hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ReplicasVerify.java#L86



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerHealthResult.java:
##########
@@ -402,4 +402,21 @@ public enum HealthState {
     OVER_REPLICATED,
     MIS_REPLICATED
   }
+
+  /**
+   * Enum representing replication status for containers.
+   * Used to indicate whether a container is under-replicated, over-replicated,
+   * or has healthy replication. This is specifically used for toString() 
output
+   * and display purposes.
+   */
+  public enum ReplicationStatus {
+    UNDER_REPLICATED,
+    OVER_REPLICATED,
+    HEALTHY_REPLICATION;

Review Comment:
   I'm not sure we need this new enum, it seems to be a simplified version of 
`HealthState`.  Since this is just for output, I think using a subset of those 
states is fine.



##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long 
containerId)
     // Cache miss - fetch and store
     ContainerInfo info = containerOperationClient.getContainer(containerId);
     String encodeToken = 
containerOperationClient.getEncodedContainerToken(containerId);
-    cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+    cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
     encodedTokenCache.put(containerId, cachedData);
     return cachedData;
   }
 
+  private String checkReplicationStatus(long containerId) {
+    try {
+      ContainerInfoToken token = getContainerInfoToken(containerId);
+      ContainerInfo containerInfo = token.getContainerInfo();
+      
+      // Get container replicas from SCM
+      List<ContainerReplicaInfo> replicaInfos = 
+          containerOperationClient.getContainerReplicas(containerId);
+      
+      if (replicaInfos.isEmpty()) {
+        return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": 
no replicas found";
+      }
+      
+      int replicationFactor = containerInfo.getReplicationFactor().getNumber();

Review Comment:
   `ReplicationFactor` does not support EC.  Please use 
`getReplicationConfig().getRequiredNodes()`.
   
   ```
   REPLICATION_CHECK_FAILED: Replication configuration of type EC does not have 
a replication factor property.
   ```



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java:
##########
@@ -257,6 +257,17 @@ public String toString() {
     if (considerUnhealthy) {
       result += " +considerUnhealthy";
     }
+
+    if (!isSufficientlyReplicated()) {
+      result += " " + ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED 
+ ": need "
+          + additionalReplicaNeeded() + " more";
+    } else if (isOverReplicated()) {
+      result += " " + ContainerHealthResult.ReplicationStatus.OVER_REPLICATED 
+ ": excess "
+          + getExcessRedundancy(true) + " replica(s)";
+    } else {
+      result += " " + 
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION;
+    }
+    

Review Comment:
   Similarly, please delete or move.



##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long 
containerId)
     // Cache miss - fetch and store
     ContainerInfo info = containerOperationClient.getContainer(containerId);
     String encodeToken = 
containerOperationClient.getEncodedContainerToken(containerId);
-    cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+    cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
     encodedTokenCache.put(containerId, cachedData);
     return cachedData;
   }
 
+  private String checkReplicationStatus(long containerId) {
+    try {
+      ContainerInfoToken token = getContainerInfoToken(containerId);
+      ContainerInfo containerInfo = token.getContainerInfo();
+      
+      // Get container replicas from SCM
+      List<ContainerReplicaInfo> replicaInfos = 
+          containerOperationClient.getContainerReplicas(containerId);

Review Comment:
   `verifyBlock` is called for a specific `datanode`, but 
`checkReplicationStatus` fetches info for all replicas.  So we are checking all 
replicas for each replica.  This should be done once and stored in the cache, 
instead of the `ContainerInfo` or "number of required nodes".



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerHealthResult.java:
##########
@@ -402,4 +402,21 @@ public enum HealthState {
     OVER_REPLICATED,
     MIS_REPLICATED
   }
+
+  /**
+   * Enum representing replication status for containers.
+   * Used to indicate whether a container is under-replicated, over-replicated,
+   * or has healthy replication. This is specifically used for toString() 
output
+   * and display purposes.
+   */
+  public enum ReplicationStatus {
+    UNDER_REPLICATED,
+    OVER_REPLICATED,
+    HEALTHY_REPLICATION;
+
+    @Override
+    public String toString() {
+      return name();
+    }

Review Comment:
   nit: not needed, since enums provide this by default.
   
   ```suggestion
   ```



##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long 
containerId)
     // Cache miss - fetch and store
     ContainerInfo info = containerOperationClient.getContainer(containerId);
     String encodeToken = 
containerOperationClient.getEncodedContainerToken(containerId);
-    cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+    cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
     encodedTokenCache.put(containerId, cachedData);
     return cachedData;
   }
 
+  private String checkReplicationStatus(long containerId) {
+    try {
+      ContainerInfoToken token = getContainerInfoToken(containerId);
+      ContainerInfo containerInfo = token.getContainerInfo();
+      
+      // Get container replicas from SCM
+      List<ContainerReplicaInfo> replicaInfos = 
+          containerOperationClient.getContainerReplicas(containerId);
+      
+      if (replicaInfos.isEmpty()) {
+        return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": 
no replicas found";
+      }
+      
+      int replicationFactor = containerInfo.getReplicationFactor().getNumber();
+      int healthyReplicas = 0;
+      
+      for (ContainerReplicaInfo replicaInfo : replicaInfos) {
+        if (!replicaInfo.getState().equals("UNHEALTHY")) {
+          healthyReplicas++;
+        }
+      }
+      
+      if (healthyReplicas < replicationFactor) {
+        return String.format("%s: %d/%d healthy replicas", 
+            ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED, 
healthyReplicas, replicationFactor);
+      } else if (healthyReplicas > replicationFactor) {
+        return String.format("%s: %d/%d healthy replicas", 
+            ContainerHealthResult.ReplicationStatus.OVER_REPLICATED, 
healthyReplicas, replicationFactor);
+      } else {
+        return 
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
+      }

Review Comment:
   nit: avoid duplication.
   
   ```suggestion
         if (healthyReplicas == replicationFactor) {
           return 
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
         }
   
         ContainerHealthResult.ReplicationStatus status = healthyReplicas < 
replicationFactor
             ? ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED
             : ContainerHealthResult.ReplicationStatus.OVER_REPLICATED;
         return String.format("%s: %d/%d healthy replicas", status, 
healthyReplicas, replicationFactor);
   ```



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECContainerReplicaCount.java:
##########
@@ -573,6 +573,19 @@ public String toString() {
         .append(", ReplicationConfig: ").append(repConfig)
         .append(", RemainingMaintenanceRedundancy: ")
         .append(remainingMaintenanceRedundancy);
+    
+    if (!isSufficientlyReplicated()) {
+      List<Integer> missingIndexes = unavailableIndexes(true);
+      sb.append(' 
').append(ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED)
+          .append(": missing indexes ").append(missingIndexes);
+    } else if (isOverReplicated()) {
+      List<Integer> excessIndexes = overReplicatedIndexes(false);
+      sb.append(' 
').append(ContainerHealthResult.ReplicationStatus.OVER_REPLICATED)
+          .append(": excess indexes ").append(excessIndexes);
+    } else {
+      sb.append(' 
').append(ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION);
+    }
+      

Review Comment:
   I don't see where this output is used for the new functionality.  Also, I 
don't think `toString()` should be performing such logic.  So if I'm missing 
something and it is indeed required, please move it to a separate function, 
otherwise please remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-13535. Flag to indicate under and over replications in container-state check [ozone]

Reply via email to