[ 
https://issues.apache.org/jira/browse/HDDS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated HDDS-8335:
-----------------------------------
    Fix Version/s: 1.4.0
       Resolution: Implemented
           Status: Resolved  (was: Patch Available)

> ReplicationManager: EC Mis and Under replication handler should handle 
> overloaded exceptions
> --------------------------------------------------------------------------------------------
>
>                 Key: HDDS-8335
>                 URL: https://issues.apache.org/jira/browse/HDDS-8335
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.4.0
>
>
> In RatisOverReplicationHandler and ECOverReplicationHandler, a container can 
> be over replicated by several replicas, and the deletes are done in two 
> stages:
> 1. First unhealthy replicas are removed.
> 2. Then healthy are removed.
> While removing any replica, the handler could get a 
> CommandTargetOverloadedException, but rather than throwing that exception 
> immediately, it continues trying other replicas. At the end, if it has not 
> deleted enough replicas, it re-throws the first 
> CommandTargetOverloadedException so the over replication is re-queued on the 
> over replication queue.
> Other handlers also have multiple stages, but in the event of an error like 
> CommandTargetOverloadedException, they give up immediately.
> RatisOverReplicationHandler works as expected. So does 
> ECOverReplicationHandler.
> For RatisUnderReplicationHandler, as the command target is the source, and 
> the RM.sentThrottleReplicationCommand() handles picking the lowest loaded 
> source - it is possible to send one command, and then fail to send the 
> second, but there is no point in retrying as it means all the sources are 
> overloaded. As things stand, it will send what it can and then throw an 
> exception, so that is fine.
> For MisReplicationHandler, which is currently shared with EC and Ratis 
> (HDDS-8109 may change this), I believe it could run into this problem with 
> EC, where it may need to make a new copy of 2 EC indexes, and 1 of the nodes 
> is overloaded and the other is not. It would be better to not fail completely 
> if the first is overloaded.
> For Ratis Mis Replication, as we can copy any replica after HDDS-8109 it 
> should behave like the RatisUnderReplicationHandler after HDDS-8109.
> For ECUnderReplicationHandler, there are multiple stages for processing and 
> potential for partial success.
> We should review both ECUnderReplicationHandler and EC MisReplication 
> handling (after HDDS-8109) to handle overloaded exceptions and throw 
> exceptions on partial success.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to