[ 
https://issues.apache.org/jira/browse/HDDS-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17745841#comment-17745841
 ] 

István Fajth commented on HDDS-9050:
------------------------------------

This issue came up in some of our internal tests, but after multiple times we 
failed to reproduce the issue, at the moment we do not have a cause.
The issue itself seems to be related to running the test with Java17 runtime, 
at least we have seen it happening 5 times, in tests that were running with 
Java17.
Also the issue seemed to be happening at the first start of the Primordial SCM 
within the SCM itself, other roles in the meantime report that they I trying to 
send a CSR to the SCM, and they get back an error. Based on our logs, if after 
this failing "scm --init" startup, a secondary start of the SCM node marked as 
primordial with --init succeeded, which suggests that the issue itself is 
intermittently happening.
We have not identified more commonalities or other suspicious behaviour other 
then these.

Looking at the code:
- the exception is caused because we call a get(key) method on a HashTable with 
a key that is null.
- the key with null value comes from a keyset of a HashMap that was created by 
Collectors.toMap from a stream of SimpleEntries
- these SimpleEntries are all initialized with a non-null key from every code 
branch where we can get to this code line where the exception is coming from

Based on this I have absolutely no idea what is causing the issue, but in the 
meantime I am not convinced we should in any ways hide this null pointer 
dereference with a null check, as we do not know which certificate extension is 
going missing in this case (if any), with that we can not be sure if we ever 
run into a more harder to trace down error due to missing an extension from our 
certificates.

Let me propose some logging around the place where we have seen the exception, 
so that we might be able to capture the missing piece if someone else or us are 
running into this again.

> Ozone fails to start because certificate is missing
> ---------------------------------------------------
>
>                 Key: HDDS-9050
>                 URL: https://issues.apache.org/jira/browse/HDDS-9050
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Devesh Kumar Singh
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>              Labels: pull-request-available
>
> Ozone fails to start because certificate is missing
> INFO 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient: 
> Certificate client init case: 6
> INFO 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient: 
> Found private and public key but certificate is missing.
> INFO org.apache.hadoop.hdds.scm.ha.HASecurityUtils: Init response: RECOVER
> ERROR org.apache.hadoop.hdds.scm.ha.HASecurityUtils: SCM security 
> initialization failed. SCM certificate is missing.
>  
> 172.27.75.14:41113
> java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because 
> "key" is null       at java.base/java.util.Hashtable.get(Hashtable.java:381)
>       at org.bouncycastle.asn1.x509.Extensions.getExtension(Unknown Source)
>       at 
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultApprover.sign(DefaultApprover.java:149)
>       at 
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultCAServer.signAndStoreCertificate(DefaultCAServer.java:289)
>       at 
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultCAServer.requestCertificate(DefaultCAServer.java:257)
>       at 
> org.apache.hadoop.hdds.security.x509.certificate.authority.DefaultCAServer.requestCertificate(DefaultCAServer.java:312)
>       at 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getEncodedCertToString(SCMSecurityProtocolServer.java:291)
>       at 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getDataNodeCertificate(SCMSecurityProtocolServer.java:189)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getDataNodeCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:202)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:117)
>       at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:94)
>       at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:16080)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>       at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>       at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to