[
https://issues.apache.org/jira/browse/HDDS-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719184#comment-17719184
]
Nandakumar commented on HDDS-7985:
----------------------------------
[~NeilJoshi], in case of a disk failure on SCM we should decommission that SCM
and then bootstrap it again. Cleaning up the storage location without
decommissioning that node will cause issues as old details of the SCM is still
present in the cluster.
> [SCM HA] On SCM Disk failure recovery causes Datanode Failure on startup
> -------------------------------------------------------------------------
>
> Key: HDDS-7985
> URL: https://issues.apache.org/jira/browse/HDDS-7985
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Neil Joshi
> Assignee: Neil Joshi
> Priority: Major
> Labels: pull-request-available
>
> Recovery from an SCM disk failure when no backup is avail requires,
> * Clean _ozone.scm.db.dirs_ __ and __ _ozone.metadata.dirs_ locations
> and bootstrapping the SCM. Whether SCM is primodial or not an error occurs
> when recovering from a failed disk with no backup when starting a datanode
> after SCM recovery.
>
> Datanodes brought up after SCM disk failure recovery are unable to start due
> to a CA certificate error observed, stating the number of certificates
> received from the SCM is greater than the number expected:
> {code:java}
> ozonesecure-ha-datanode1-1 | 2023-02-17 00:46:40 INFO HAUtils:457 -
> Expected CA list size 4, where as received CA List size 5.{code}
> In this case when listing the certificates stored by the SCM, it reports a
> total of 5 scm certificates after SCM2 recovers from disk failure:
>
> {code:java}
> [email protected]
> [email protected]
> [email protected]
> [email protected]
> [email protected]
>
> {code}
> It appears to have 2 entries for SCM 2 (the scm disk failure recovery node)
>
> $ ozone admin certs list
> bash-4.2$ ozone admin cert list
> {code:java}
> Total 12 valid certificates:
> SerialNumber Valid From Expiry
> Subject
>
> 1 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, [email protected]
> 10760186198072 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, [email protected]
> 10779888473070 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=recon@recon
> 10780166036417 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f99f1a81-7cce-44c9-a09b-9f7bbc48b6ac, [email protected]
> 10788394717480 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=598be6bc-7d86-4cab-84dc-668a162a7ec2, [email protected]
> 10800769855768 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@bd3138308a3f
> 10801305457014 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@e4795cc77124
> 10801871334038 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@3eb28ff965a1
> 10803980992569 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om2
> 10804543987939 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om3
> 10806118720884 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om1
> 10932809284268 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=b4a175f3-c6a4-47fd-bcc5-c081b03de8c7, [email protected] {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]