adoroszlai commented on code in PR #9135:
URL: https://github.com/apache/ozone/pull/9135#discussion_r2573541401
##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long
containerId)
// Cache miss - fetch and store
ContainerInfo info = containerOperationClient.getContainer(containerId);
String encodeToken =
containerOperationClient.getEncodedContainerToken(containerId);
- cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+ cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
encodedTokenCache.put(containerId, cachedData);
return cachedData;
}
+ private String checkReplicationStatus(long containerId) {
+ try {
+ ContainerInfoToken token = getContainerInfoToken(containerId);
+ ContainerInfo containerInfo = token.getContainerInfo();
+
+ // Get container replicas from SCM
+ List<ContainerReplicaInfo> replicaInfos =
+ containerOperationClient.getContainerReplicas(containerId);
+
+ if (replicaInfos.isEmpty()) {
+ return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ":
no replicas found";
+ }
+
+ int replicationFactor = containerInfo.getReplicationFactor().getNumber();
+ int healthyReplicas = 0;
+
+ for (ContainerReplicaInfo replicaInfo : replicaInfos) {
+ if (!replicaInfo.getState().equals("UNHEALTHY")) {
+ healthyReplicas++;
+ }
+ }
+
+ if (healthyReplicas < replicationFactor) {
+ return String.format("%s: %d/%d healthy replicas",
+ ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED,
healthyReplicas, replicationFactor);
+ } else if (healthyReplicas > replicationFactor) {
+ return String.format("%s: %d/%d healthy replicas",
+ ContainerHealthResult.ReplicationStatus.OVER_REPLICATED,
healthyReplicas, replicationFactor);
+ } else {
+ return
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
+ }
+
+ } catch (Exception e) {
+ return "REPLICATION_CHECK_FAILED: " + e.getMessage();
+ }
+ }
+
private static class ContainerInfoToken {
private HddsProtos.LifeCycleState state;
private final String encodedToken;
+ private final ContainerInfo containerInfo;
Review Comment:
`ContainerInfoToken` is stored in `encodedTokenCache`, with default cache
size of 1 million. `ContainerInfo` is a much larger object, so the process may
run out of memory. The new logic only requires the number of nodes in addition
to existing items, so storing the complete `ContainerInfo` object is
unnecessary. But I don't think even "number of required nodes" is the right
information to cache (see my comment on `getContainerReplicas`).
Also, expected memory requirement of the cache is mentioned in CLI help. It
should be updated to reflect the increased size:
https://github.com/apache/ozone/blob/8737837d5c8493bf39ba36e0e01ad70aab531339/hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ReplicasVerify.java#L86
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerHealthResult.java:
##########
@@ -402,4 +402,21 @@ public enum HealthState {
OVER_REPLICATED,
MIS_REPLICATED
}
+
+ /**
+ * Enum representing replication status for containers.
+ * Used to indicate whether a container is under-replicated, over-replicated,
+ * or has healthy replication. This is specifically used for toString()
output
+ * and display purposes.
+ */
+ public enum ReplicationStatus {
+ UNDER_REPLICATED,
+ OVER_REPLICATED,
+ HEALTHY_REPLICATION;
Review Comment:
I'm not sure we need this new enum, it seems to be a simplified version of
`HealthState`. Since this is just for output, I think using a subset of those
states is fine.
##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long
containerId)
// Cache miss - fetch and store
ContainerInfo info = containerOperationClient.getContainer(containerId);
String encodeToken =
containerOperationClient.getEncodedContainerToken(containerId);
- cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+ cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
encodedTokenCache.put(containerId, cachedData);
return cachedData;
}
+ private String checkReplicationStatus(long containerId) {
+ try {
+ ContainerInfoToken token = getContainerInfoToken(containerId);
+ ContainerInfo containerInfo = token.getContainerInfo();
+
+ // Get container replicas from SCM
+ List<ContainerReplicaInfo> replicaInfos =
+ containerOperationClient.getContainerReplicas(containerId);
+
+ if (replicaInfos.isEmpty()) {
+ return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ":
no replicas found";
+ }
+
+ int replicationFactor = containerInfo.getReplicationFactor().getNumber();
Review Comment:
`ReplicationFactor` does not support EC. Please use
`getReplicationConfig().getRequiredNodes()`.
```
REPLICATION_CHECK_FAILED: Replication configuration of type EC does not have
a replication factor property.
```
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java:
##########
@@ -257,6 +257,17 @@ public String toString() {
if (considerUnhealthy) {
result += " +considerUnhealthy";
}
+
+ if (!isSufficientlyReplicated()) {
+ result += " " + ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED
+ ": need "
+ + additionalReplicaNeeded() + " more";
+ } else if (isOverReplicated()) {
+ result += " " + ContainerHealthResult.ReplicationStatus.OVER_REPLICATED
+ ": excess "
+ + getExcessRedundancy(true) + " replica(s)";
+ } else {
+ result += " " +
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION;
+ }
+
Review Comment:
Similarly, please delete or move.
##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long
containerId)
// Cache miss - fetch and store
ContainerInfo info = containerOperationClient.getContainer(containerId);
String encodeToken =
containerOperationClient.getEncodedContainerToken(containerId);
- cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+ cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
encodedTokenCache.put(containerId, cachedData);
return cachedData;
}
+ private String checkReplicationStatus(long containerId) {
+ try {
+ ContainerInfoToken token = getContainerInfoToken(containerId);
+ ContainerInfo containerInfo = token.getContainerInfo();
+
+ // Get container replicas from SCM
+ List<ContainerReplicaInfo> replicaInfos =
+ containerOperationClient.getContainerReplicas(containerId);
Review Comment:
`verifyBlock` is called for a specific `datanode`, but
`checkReplicationStatus` fetches info for all replicas. So we are checking all
replicas for each replica. This should be done once and stored in the cache,
instead of the `ContainerInfo` or "number of required nodes".
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerHealthResult.java:
##########
@@ -402,4 +402,21 @@ public enum HealthState {
OVER_REPLICATED,
MIS_REPLICATED
}
+
+ /**
+ * Enum representing replication status for containers.
+ * Used to indicate whether a container is under-replicated, over-replicated,
+ * or has healthy replication. This is specifically used for toString()
output
+ * and display purposes.
+ */
+ public enum ReplicationStatus {
+ UNDER_REPLICATED,
+ OVER_REPLICATED,
+ HEALTHY_REPLICATION;
+
+ @Override
+ public String toString() {
+ return name();
+ }
Review Comment:
nit: not needed, since enums provide this by default.
```suggestion
```
##########
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/replicas/ContainerStateVerifier.java:
##########
@@ -155,18 +166,57 @@ private ContainerInfoToken getContainerInfoToken(long
containerId)
// Cache miss - fetch and store
ContainerInfo info = containerOperationClient.getContainer(containerId);
String encodeToken =
containerOperationClient.getEncodedContainerToken(containerId);
- cachedData = new ContainerInfoToken(info.getState(), encodeToken);
+ cachedData = new ContainerInfoToken(info.getState(), encodeToken, info);
encodedTokenCache.put(containerId, cachedData);
return cachedData;
}
+ private String checkReplicationStatus(long containerId) {
+ try {
+ ContainerInfoToken token = getContainerInfoToken(containerId);
+ ContainerInfo containerInfo = token.getContainerInfo();
+
+ // Get container replicas from SCM
+ List<ContainerReplicaInfo> replicaInfos =
+ containerOperationClient.getContainerReplicas(containerId);
+
+ if (replicaInfos.isEmpty()) {
+ return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ":
no replicas found";
+ }
+
+ int replicationFactor = containerInfo.getReplicationFactor().getNumber();
+ int healthyReplicas = 0;
+
+ for (ContainerReplicaInfo replicaInfo : replicaInfos) {
+ if (!replicaInfo.getState().equals("UNHEALTHY")) {
+ healthyReplicas++;
+ }
+ }
+
+ if (healthyReplicas < replicationFactor) {
+ return String.format("%s: %d/%d healthy replicas",
+ ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED,
healthyReplicas, replicationFactor);
+ } else if (healthyReplicas > replicationFactor) {
+ return String.format("%s: %d/%d healthy replicas",
+ ContainerHealthResult.ReplicationStatus.OVER_REPLICATED,
healthyReplicas, replicationFactor);
+ } else {
+ return
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
+ }
Review Comment:
nit: avoid duplication.
```suggestion
if (healthyReplicas == replicationFactor) {
return
ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
}
ContainerHealthResult.ReplicationStatus status = healthyReplicas <
replicationFactor
? ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED
: ContainerHealthResult.ReplicationStatus.OVER_REPLICATED;
return String.format("%s: %d/%d healthy replicas", status,
healthyReplicas, replicationFactor);
```
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECContainerReplicaCount.java:
##########
@@ -573,6 +573,19 @@ public String toString() {
.append(", ReplicationConfig: ").append(repConfig)
.append(", RemainingMaintenanceRedundancy: ")
.append(remainingMaintenanceRedundancy);
+
+ if (!isSufficientlyReplicated()) {
+ List<Integer> missingIndexes = unavailableIndexes(true);
+ sb.append('
').append(ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED)
+ .append(": missing indexes ").append(missingIndexes);
+ } else if (isOverReplicated()) {
+ List<Integer> excessIndexes = overReplicatedIndexes(false);
+ sb.append('
').append(ContainerHealthResult.ReplicationStatus.OVER_REPLICATED)
+ .append(": excess indexes ").append(excessIndexes);
+ } else {
+ sb.append('
').append(ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION);
+ }
+
Review Comment:
I don't see where this output is used for the new functionality. Also, I
don't think `toString()` should be performing such logic. So if I'm missing
something and it is indeed required, please move it to a separate function,
otherwise please remove it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]