[
https://issues.apache.org/jira/browse/HDDS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-8335:
---------------------------------
Labels: pull-request-available (was: )
> ReplicationManager: EC Mis and Under replication handler should handle
> overloaded exceptions
> --------------------------------------------------------------------------------------------
>
> Key: HDDS-8335
> URL: https://issues.apache.org/jira/browse/HDDS-8335
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
> Labels: pull-request-available
>
> In RatisOverReplicationHandler and ECOverReplicationHandler, a container can
> be over replicated by several replicas, and the deletes are done in two
> stages:
> 1. First unhealthy replicas are removed.
> 2. Then healthy are removed.
> While removing any replica, the handler could get a
> CommandTargetOverloadedException, but rather than throwing that exception
> immediately, it continues trying other replicas. At the end, if it has not
> deleted enough replicas, it re-throws the first
> CommandTargetOverloadedException so the over replication is re-queued on the
> over replication queue.
> Other handlers also have multiple stages, but in the event of an error like
> CommandTargetOverloadedException, they give up immediately.
> RatisOverReplicationHandler works as expected. So does
> ECOverReplicationHandler.
> For RatisUnderReplicationHandler, as the command target is the source, and
> the RM.sentThrottleReplicationCommand() handles picking the lowest loaded
> source - it is possible to send one command, and then fail to send the
> second, but there is no point in retrying as it means all the sources are
> overloaded. As things stand, it will send what it can and then throw an
> exception, so that is fine.
> For MisReplicationHandler, which is currently shared with EC and Ratis
> (HDDS-8109 may change this), I believe it could run into this problem with
> EC, where it may need to make a new copy of 2 EC indexes, and 1 of the nodes
> is overloaded and the other is not. It would be better to not fail completely
> if the first is overloaded.
> For Ratis Mis Replication, as we can copy any replica after HDDS-8109 it
> should behave like the RatisUnderReplicationHandler after HDDS-8109.
> For ECUnderReplicationHandler, there are multiple stages for processing and
> potential for partial success.
> We should review both ECUnderReplicationHandler and EC MisReplication
> handling (after HDDS-8109) to handle overloaded exceptions and throw
> exceptions on partial success.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]