[jira] [Updated] (HDDS-5090) make Decommission work under SCM HA.

Glen Geng (Jira) Sun, 11 Apr 2021 23:46:05 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-5090:
----------------------------
    Description: 
*The problem*

The decommission/maintenance info is saved in memory of SCM, and if SCM is 
restarted, it relearns this info during re-register of Datanode.

Only leader SCM handles the decommissionNodes(), recommissionNodes(), 
startMaintenanceNodes() request, and not replicate these info to follower SCM, 
thus when failover happens, the new leader SCM will lose this info, since they 
are saved in memory of previous leader SCM.

*Current status*
 If a SCM is restarted, then upon re-registration the datanode will already be 
in DECOMMISSIONING or ENTERING_MAINTENANCE or IN_MAINTENANCE state. In that 
case, it needs to be added back into the monitor to track its progress.

For a registered node, the information stored in SCM is the source of truth. If 
SCM finds that the opState or opStateExpiryEpoch is different from what it 
saves in memory, it will send SetNodeOperationalStateCommand to update the 
Datanode.

*The solution*

leader SCM --hb--> DN --hb--> follower SCM

1, Leader SCM updates PersistedOpState of Datanode via heartbeat. Datanode 
update OpState in follower SCM via heartbeat.

2, When follower SCM becomes leader, it calls continueAdminForNode for all 
datanode, so that the DECOMMISSIONING, ENTERING_MAINTENANCE, IN_MAINTENANCE 
datanode will be added back to the monitor.

*Disadvantage*

The same as now, if leader SCM records the info, notifies Datanode via 
heartbeat, but steps down before Datanode notifies follower SCM via heartbeat, 
that info will be lost in the new leader SCM.

As discussed with [~sodonnell], we can live with the rare event of a 
decommission starting and SCM failing over before the state has made it to the 
DNs.

 

  was:
*The problem*

The decommission/maintenance info is saved in memory of SCM, and if SCM is 
restarted, it relearns this info during re-register of Datanode.

Only leader SCM handles the decommissionNodes(), recommissionNodes(), 
startMaintenanceNodes() request, and not replicate these info to follower SCM, 
thus when failover happens, the new leader SCM will lose this info, since they 
are saved in memory of previous leader SCM.

*Current status*
 If a SCM is restarted, then upon re-registration the datanode will already be 
in DECOMMISSIONING or ENTERING_MAINTENANCE or IN_MAINTENANCE state. In that 
case, it needs to be added back into the monitor to track its progress.

For a registered node, the information stored in SCM is the source of truth. If 
SCM finds that the opState or opStateExpiryEpoch is different from what it 
saves in memory, it will send SetNodeOperationalStateCommand to update the 
Datanode.

*The solution*

leader SCM --hb--> DN --hb--> follower SCM

1, Leader SCM updates PersistedOpState of Datanode via heartbeat. Datanode 
update OpState in follower SCM via heartbeat.

2, When follower SCM becomes leader, it calls continueAdminForNode for all 
datanode, so that the DECOMMISSIONING, ENTERING_MAINTENANCE, IN_MAINTENANCE 
datanode will be added back to the monitor.

*Disadvantage*

The same as now, if leader SCM records the info, notifies Datanode via 
heartbeat, but steps down before Datanode notifies follower SCM via heartbeat, 
that info will be lost in the new leader SCM.

 


> make Decommission work under SCM HA.
> ------------------------------------
>
>                 Key: HDDS-5090
>                 URL: https://issues.apache.org/jira/browse/HDDS-5090
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>
> *The problem*
> The decommission/maintenance info is saved in memory of SCM, and if SCM is 
> restarted, it relearns this info during re-register of Datanode.
> Only leader SCM handles the decommissionNodes(), recommissionNodes(), 
> startMaintenanceNodes() request, and not replicate these info to follower 
> SCM, thus when failover happens, the new leader SCM will lose this info, 
> since they are saved in memory of previous leader SCM.
> *Current status*
>  If a SCM is restarted, then upon re-registration the datanode will already 
> be in DECOMMISSIONING or ENTERING_MAINTENANCE or IN_MAINTENANCE state. In 
> that case, it needs to be added back into the monitor to track its progress.
> For a registered node, the information stored in SCM is the source of truth. 
> If SCM finds that the opState or opStateExpiryEpoch is different from what it 
> saves in memory, it will send SetNodeOperationalStateCommand to update the 
> Datanode.
> *The solution*
> leader SCM --hb--> DN --hb--> follower SCM
> 1, Leader SCM updates PersistedOpState of Datanode via heartbeat. Datanode 
> update OpState in follower SCM via heartbeat.
> 2, When follower SCM becomes leader, it calls continueAdminForNode for all 
> datanode, so that the DECOMMISSIONING, ENTERING_MAINTENANCE, IN_MAINTENANCE 
> datanode will be added back to the monitor.
> *Disadvantage*
> The same as now, if leader SCM records the info, notifies Datanode via 
> heartbeat, but steps down before Datanode notifies follower SCM via 
> heartbeat, that info will be lost in the new leader SCM.
> As discussed with [~sodonnell], we can live with the rare event of a 
> decommission starting and SCM failing over before the state has made it to 
> the DNs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Updated] (HDDS-5090) make Decommission work under SCM HA.

Reply via email to