[ https://issues.apache.org/jira/browse/HDDS-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974366#comment-16974366 ]
Stephen O'Donnell edited comment on HDDS-2459 at 11/14/19 5:25 PM: ------------------------------------------------------------------- In the decommission design doc, we had an algorithm to determine the number of replicas that need to be created or destroy so a container can be perfectly replicated. The algorithm was: {code} /** * Calculate the number of the missing replicas. * * @return the number of the missing replicas. If it's less than zero, the container is over replicated. */ int getReplicationCount(int expectedCount, int healthy, int maintenance, int inFlight) { //for over replication, count only with the healthy replicas if (expectedCount < healthy) { return expectedCount - healthy; } replicaCount = expectedCount - (healthy + maintenance + inFlight); if (replicaCount == 0 && healthy < 1) { replicaCount ++; } //over replication is already handled return Math.max(0, replicaCount); } {code} The code from the design doc needs a minor correction to handle inflight deletes on over replication, so it would look like this: {code} public int additionalReplicaNeeded2() { if (repFactor < healthyCount) { return repFactor - healthyCount + inFlightDel; } int delta = repFactor - (healthyCount + maintenanceCount + inFlightAdd - inFlightDel); if (delta == 0 && healthyCount < minHealthyForMaintenance) { delta += minHealthyForMaintenance - healthyCount; } return Math.max(0, delta); } {code} I also came up with the logic below, which is very similar although a little more verbose. The only different between the above and the below, is that in the case of 3 in_service replicas and one or more inflight deletes, the above will return 1 new replica needed, but the below will return zero. The reasoning is that we should let the delete complete or not, as it may fail, and then deal with the over or under replication when the inflight operations have cleared. {code} /** * Calculates the the delta of replicas which need to be created or removed * to ensure the container is correctly replicated. * * Decisions around over-replication are made only on healthy replicas, * ignoring any in maintenance and also any inflight adds. InFlight adds are * ignored, as they may not complete, so if we have: * * H, H, H, IN_FLIGHT_ADD * * And then schedule a delete, we could end up under-replicated (add fails, * delete completes). It is better to let the inflight operations complete * and then deal with any further over or under replication. * * For maintenance replicas, assuming replication factor 3, and minHealthy * 2, it is possible for all 3 hosts to be put into maintenance, leaving the * following (H = healthy, M = maintenance): * * H, H, M, M, M * * Even though we are tracking 5 replicas, this is not over replicated as we * ignore the maintenance copies. Later, the replicas could look like: * * H, H, H, H, M * * At this stage, the container is over replicated by 1, so one replica can be * removed. * * For containers which have replication factor healthy replica, we ignore any * inflight add or deletes, as they may fail. Instead, wait for them to * complete and then deal with any excess or deficit. * * For under replicated containers we do consider inflight add and delete to * avoid scheduling more adds than needed. There is additional logic around * containers with maintenance replica to ensure minHealthyForMaintenance * replia are maintained/ * * @return Delta of replicas needed. Negative indicates over replication and * containers should be removed. Positive indicates over replication * and zero indicates the containers has replicationFactor healthy * replica */ public int additionalReplicaNeeded() { int delta = repFactor - healthyCount; if (delta < 0) { // Over replicated, so may need to remove a block. Do not consider // inFlightAdds, as they may fail, but do consider inFlightDel which // will reduce the over-replication if it completes. return delta + inFlightDel; } else if (delta > 0) { // May be under-replicated, depending on maintenance. When a container is // under-replicated, we must consider inflight add and delete when // calculating the new containers needed. delta = Math.max(0, delta - maintenanceCount); // Check we have enough healthy replicas int neededHealthy = Math.max(0, minHealthyForMaintenance - healthyCount); delta = Math.max(neededHealthy, delta); return delta - inFlightAdd + inFlightDel; } else { // delta == 0 // We have exactly the number of healthy replicas needed, but there may // be inflight add or delete. Ignore them until they complete or fail // and then deal with the excess or deficit. return delta; } } } {code} The following logic also describes the conditions the replica for a container must meet to be considered sufficiently replicated - note that inflight adds are ignored and inflight deletes are considered until they complete: {code} /** * Return true if the container is sufficiently replicated. Decommissioning * and Decommissioned containers are ignored in this check, assuming they will * eventually be removed from the cluster. * This check ignores inflight additions, as those replicas have not yet been * created and the create could fail for some reason. * The check does consider inflight deletes as there may be 3 healthy replicas * now, but once the delete completes it will reduce to 2. * We also assume a replica in Maintenance state cannot be removed, so the * pending delete would affect only the healthy replica count. * * @return True if the container is sufficiently replicated and False * otherwise. */ public boolean isSufficientlyReplicated() { return (healthyCount + maintenanceCount - inFlightDel) >= repFactor && healthyCount - inFlightDel >= minHealthyForMaintenance; } {code} was (Author: sodonnell): In the decommission design doc, we had an algorithm to determine the number of replicas that need to be created or destroy so a container can be perfectly replicated. The algorithm was: {code} /** * Calculate the number of the missing replicas. * * @return the number of the missing replicas. If it's less than zero, the container is over replicated. */ int getReplicationCount(int expectedCount, int healthy, int maintenance, int inFlight) { //for over replication, count only with the healthy replicas if (expectedCount < healthy) { return expectedCount - healthy; } replicaCount = expectedCount - (healthy + maintenance + inFlight); if (replicaCount == 0 && healthy < 1) { replicaCount ++; } //over replication is already handled return Math.max(0, replicaCount); } {code} Reflecting on this for some time, I think it is a little too simplistic and would propose the following instead. One key difference in the logic below is that maintenance replicas are not considered when calculating over replicated. This is because a maintenance copy cannot be removed (the node is offline) and there is not insignificant change the node will fail to come back online, resulting in all its replicas getting lost. {code} /** * Calculates the the delta of replicas which need to be created or removed * to ensure the container is correctly replicated. * * Decisions around over-replication are made only on healthy replicas, * ignoring any in maintenance and also any inflight adds. InFlight adds are * ignored, as they may not complete, so if we have: * * H, H, H, IN_FLIGHT_ADD * * And then schedule a delete, we could end up under-replicated (add fails, * delete completes). It is better to let the inflight operations complete * and then deal with any further over or under replication. * * For maintenance replicas, assuming replication factor 3, and minHealthy * 2, it is possible for all 3 hosts to be put into maintenance, leaving the * following (H = healthy, M = maintenance): * * H, H, M, M, M * * Even though we are tracking 5 replicas, this is not over replicated as we * ignore the maintenance copies. Later, the replicas could look like: * * H, H, H, H, M * * At this stage, the container is over replicated by 1, so one replica can be * removed. * * For containers which have replication factor healthy replica, we ignore any * inflight add or deletes, as they may fail. Instead, wait for them to * complete and then deal with any excess or deficit. * * For under replicated containers we do consider inflight add and delete to * avoid scheduling more adds than needed. There is additional logic around * containers with maintenance replica to ensure minHealthyForMaintenance * replia are maintained/ * * @return Delta of replicas needed. Negative indicates over replication and * containers should be removed. Positive indicates over replication * and zero indicates the containers has replicationFactor healthy * replica */ public int additionalReplicaNeeded() { int blockDelta = 0; int delta = repFactor - healthyCount; if (delta < 0) { // Over replicated, so may need to remove a block. Do not consider // inFlightAdds, as they may fail, but do consider inFlightDel which // will reduce the over-replication if it completes. blockDelta = delta + inFlightDel; } else if (delta > 0) { // May be under-replicated, depending on maintenance. When a container is // under-replicated, we must consider inflight add and delete when // calculating the new containers needed. if (maintenanceCount != 0) { // Remove maintenance copies from delta to see if it is really // under-replicated. delta = Math.max(0, delta - maintenanceCount); // Check we have enough healthy replicas int neededHealthy = Math.max(0, minHealthyForMaintenance - healthyCount); delta = Math.max(neededHealthy, delta); } blockDelta = delta - inFlightAdd + inFlightDel; } else { // delta == 0 // We have exactly the number of healthy replicas needed, but there may // be inflight add or delete. Ignore them until they complete or fail // and then deal with the excess or deficit. blockDelta = delta; } return blockDelta; {code} The following logic also describes the conditions the replica for a container must meet to be considered sufficiently replicated - note that inflight adds are ignored and inflight deletes are considered until they complete: {code} /** * Return true if the container is sufficiently replicated. Decommissioning * and Decommissioned containers are ignored in this check, assuming they will * eventually be removed from the cluster. * This check ignores inflight additions, as those replicas have not yet been * created and the create could fail for some reason. * The check does consider inflight deletes as there may be 3 healthy replicas * now, but once the delete completes it will reduce to 2. * We also assume a replica in Maintenance state cannot be removed, so the * pending delete would affect only the healthy replica count. * * @return True if the container is sufficiently replicated and False * otherwise. */ public boolean isSufficientlyReplicated() { return (healthyCount + maintenanceCount - inFlightDel) >= repFactor && healthyCount - inFlightDel >= minHealthyForMaintenance; } {code} > Refactor ReplicationManager to consider maintenance states > ---------------------------------------------------------- > > Key: HDDS-2459 > URL: https://issues.apache.org/jira/browse/HDDS-2459 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM > Affects Versions: 0.5.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > > In its current form the replication manager does not consider decommission or > maintenance states when checking if replicas are sufficiently replicated. > With the introduction of maintenance states, it needs to consider > decommission and maintenance states when deciding if blocks are over or under > replicated. > It also needs to provide an API to allow the decommission manager to check if > blocks are over or under replicated, so the decommission manager can decide > if a node has completed decommission and maintenance or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org