István Fajth created HDDS-10817:
-----------------------------------

             Summary: Non-HA SCM node can not start after upgrading to 1.4, or 
current master
                 Key: HDDS-10817
                 URL: https://issues.apache.org/jira/browse/HDDS-10817
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: István Fajth
            Assignee: István Fajth


The exact commit is unclear that caused the problem, but there are two things 
that we have observed and which causes trouble.

One is [this 
condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
 which prevents the initialization of the certificate client, and leads to an 
NPE later on in 
[initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].

Also even if we resolve that NPE, the certificateClient remains uninitialized 
so later on other problems would arise when the system tries to access it, so 
the initial condition is about to be changed or fullfiled somehow during 
initialization or upgrade.

Once the certificate client is initialized, we start to see an other problem, 
now with SecretKeyManager, as it might miss its initialization due to [this 
check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
 where ratisEnabled evaluates to true if the config for 
{ozone.scm.ratis.enable} is set and uses the default value.

The problem is that the scmInit method overwrites the VERSION file and sets the 
SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set 
[here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
 So if one starts the only SCM by running start with --init after the upgrade, 
and then start without arguments, then this issue appears after fixing the 
VERSION file for the first issue.

So in order to prevent both issues (and preserve the idempotency of the --init 
startup option) we need two changes to happen on this upgrade to prevent these 
two issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to