[
https://issues.apache.org/jira/browse/HDDS-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17745841#comment-17745841
]
István Fajth commented on HDDS-9050:
------------------------------------
This issue came up in some of our internal tests, but after multiple times we
failed to reproduce the issue, at the moment we do not have a cause.
The issue itself seems to be related to running the test with Java17 runtime,
at least we have seen it happening 5 times, in tests that were running with
Java17.
Also the issue seemed to be happening at the first start of the Primordial SCM
within the SCM itself, other roles in the meantime report that they I trying to
send a CSR to the SCM, and they get back an error. Based on our logs, if after
this failing "scm --init" startup, a secondary start of the SCM node marked as
primordial with --init succeeded, which suggests that the issue itself is
intermittently happening.
We have not identified more commonalities or other suspicious behaviour other
then these.
Looking at the code:
- the exception is caused because we call a get(key) method on a HashTable with
a key that is null.
- the key with null value comes from a keyset of a HashMap that was created by
Collectors.toMap from a stream of SimpleEntries
- these SimpleEntries are all initialized with a non-null key from every code
branch where we can get to this code line where the exception is coming from
Based on this I have absolutely no idea what is causing the issue, but in the
meantime I am not convinced we should in any ways hide this null pointer
dereference with a null check, as we do not know which certificate extension is
going missing in this case (if any), with that we can not be sure if we ever
run into a more harder to trace down error due to missing an extension from our
certificates.
Let me propose some logging around the place where we have seen the exception,
so that we might be able to capture the missing piece if someone else or us are
running into this again.
> Ozone fails to start because certificate is missing
> ---------------------------------------------------
>
> Key: HDDS-9050
> URL: https://issues.apache.org/jira/browse/HDDS-9050
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Devesh Kumar Singh
> Assignee: Devesh Kumar Singh
> Priority: Major
> Labels: pull-request-available
>
> Ozone fails to start because certificate is missing
> INFO
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient:
> Certificate client init case: 6
> INFO
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient:
> Found private and public key but certificate is missing.
> INFO org.apache.hadoop.hdds.scm.ha.HASecurityUtils: Init response: RECOVER
> ERROR org.apache.hadoop.hdds.scm.ha.HASecurityUtils: SCM security
> initialization failed. SCM certificate is missing.
>
> 172.27.75.14:41113
> java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because
> "key" is null at java.base/java.util.Hashtable.get(Hashtable.java:381)
> at org.bouncycastle.asn1.x509.Extensions.getExtension(Unknown Source)
> at
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultApprover.sign(DefaultApprover.java:149)
> at
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultCAServer.signAndStoreCertificate(DefaultCAServer.java:289)
> at
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultCAServer.requestCertificate(DefaultCAServer.java:257)
> at
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultCAServer.requestCertificate(DefaultCAServer.java:312)
> at
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getEncodedCertToString(SCMSecurityProtocolServer.java:291)
> at
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getDataNodeCertificate(SCMSecurityProtocolServer.java:189)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getDataNodeCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:202)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:117)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:94)
> at
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:16080)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]