[ 
https://issues.apache.org/jira/browse/HDDS-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985106#comment-16985106
 ] 

Stephen O'Donnell commented on HDDS-2607:
-----------------------------------------

The NodeStateManager is responsible for firing a "dead node" event, but it 
currently only does this if the node is "IN_SERVICE". It will not do it if it 
is DECOMMISSIONING, DECOMMISSIONED, ENTERING_MAINTENANCE or IN_MAINTENANCE.

As part of this Jira we need to fix this, as the only time a dead node should 
not have the dead node event fired is when it is IN_MAINTENANCE. At other 
times, a "dead node event" should clear the nodes containers replica as usual. 

It is also important that the DatandeAdminMonitor aborts its workflow for any 
node which goes dead while maintenance is in progress (unless it has already 
reached IN_MAINTENANCE), for several reasons:

1. The dead node event will delete all the container replicas for the node, so 
its impossible to track them for replication correctly.
2. This could result in a node which is node completed decom / maintenance 
getting marked as completed.
3. If the node returns to service, the state on the cluster may have changed 
and new pipelines should be created etc meaning the admin workflow needs to 
restart.

In this Jira, we should therefore consider:

1. Resetting the nodes OperationalState to "IN_SERVICE" as part of the dead 
node handling.
2. Ensure the dead node event gets triggered for all operational states except 
IN_MAINTENANCE
3. The maintenance workflow is aborted if the health of any nodes becomes "DEAD"
4. How to trigger a dead node event for a node which is dead and was 
IN_MAINTENANCE and maintenance has ended either automatically or manually.

> DeadNodeHandler should not remove replica for a dead maintenance node
> ---------------------------------------------------------------------
>
>                 Key: HDDS-2607
>                 URL: https://issues.apache.org/jira/browse/HDDS-2607
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> Normally, when a node goes dead, the DeadNodeHandler removes all the 
> containers and replica associated with the node from the ContainerManager.
> If a node is IN_MAINTENANCE and goes dead, then we do not want to remove its 
> replica. They should remain present in the system to prevent the container 
> being marked as under-replicated.
> We also need to consider the case where the node is dead, and then 
> maintenance expires automatically. In that case, the replica associated with 
> the node must be removed and the affected containers will become 
> under-replicated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to