sarvekshayr opened a new pull request, #8765: URL: https://github.com/apache/ozone/pull/8765
## What changes were proposed in this pull request? The method `HAUtils.getCAListWithRetry()` currently uses `RetryPolicies.retryForeverWithFixedSleep()`, which causes it to retry indefinitely on any failure. When authentication is not set up (i.e., kinit is not run), this results in an `AccessControlException`. Since the method retries forever without handling this specific exception, commands like `ozone admin container create` appear to hang indefinitely. Fixed the logic to detect `AccessControlException` in the retry policy and fail fast. ## What is the link to the Apache JIRA [HDDS-13405](https://issues.apache.org/jira/browse/HDDS-13405) ## How was this patch tested? Before the fix ``` bash-5.1$ OZONE_LOGLEVEL=INFO ozone admin container create 2025-07-08 10:44:48,181 [main] INFO proxy.SCMContainerLocationFailoverProxyProvider: Created fail-over proxy for protocol StorageContainerLocationProtocolPB with 3 nodes: [nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860, nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860, nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860] 2025-07-08 10:44:48,229 [main] INFO proxy.SecretKeyProtocolFailoverProxyProvider: Created fail-over proxy for protocol SecretKeyProtocolScmPB with 3 nodes: [nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9961, nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9961, nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9961] 2025-07-08 10:44:48,402 [main] INFO proxy.SCMSecurityProtocolFailoverProxyProvider: Created fail-over proxy for protocol SCMSecurityProtocolPB with 3 nodes: [nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9961, nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9961, nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9961] 2025-07-08 10:44:48,470 [main] WARN ipc.Client: Exception encountered while connecting to the server scm1.org/172.25.0.116:9961 org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS] at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:179) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:399) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:578) at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:364) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:799) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:795) at java.base/java.security.AccessController.doPrivileged(AccessController.java:714) at java.base/javax.security.auth.Subject.doAs(Subject.java:525) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:364) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1649) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1426) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132) at jdk.proxy2/jdk.proxy2.$Proxy22.submitRequest(Unknown Source) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366) at jdk.proxy2/jdk.proxy2.$Proxy22.submitRequest(Unknown Source) at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:93) at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:363) at org.apache.hadoop.hdds.utils.HAUtils.waitForCACerts(HAUtils.java:374) at org.apache.hadoop.hdds.utils.HAUtils.lambda$buildCAX509List$3(HAUtils.java:401) at org.apache.hadoop.hdds.utils.RetriableTask.call(RetriableTask.java:55) at org.apache.hadoop.hdds.utils.HAUtils.getCAListWithRetry(HAUtils.java:360) at org.apache.hadoop.hdds.utils.HAUtils.buildCAX509List(HAUtils.java:401) at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.lambda$newXCeiverClientManager$0(ContainerOperationClient.java:123) at org.apache.hadoop.hdds.scm.client.ClientTrustManager.loadCerts(ClientTrustManager.java:148) at org.apache.hadoop.hdds.scm.client.ClientTrustManager.<init>(ClientTrustManager.java:110) at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.newXCeiverClientManager(ContainerOperationClient.java:125) at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.getXceiverClientManager(ContainerOperationClient.java:91) at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.createContainer(ContainerOperationClient.java:212) at org.apache.hadoop.hdds.scm.cli.container.CreateSubcommand.execute(CreateSubcommand.java:59) at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39) at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29) at picocli.CommandLine.executeUserObject(CommandLine.java:2031) at picocli.CommandLine.access$1500(CommandLine.java:148) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2469) at picocli.CommandLine$RunLast.handle(CommandLine.java:2461) at picocli.CommandLine$RunLast.handle(CommandLine.java:2423) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) at picocli.CommandLine$RunLast.execute(CommandLine.java:2425) at org.apache.hadoop.ozone.shell.Shell.lambda$execute$0(Shell.java:95) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:167) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:157) at org.apache.hadoop.ozone.shell.Shell.execute(Shell.java:95) at picocli.CommandLine.execute(CommandLine.java:2174) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:89) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:80) at org.apache.hadoop.ozone.admin.OzoneAdmin.main(OzoneAdmin.java:36) 2025-07-08 10:44:48,478 [main] INFO utils.RetriableTask: Execution of task getCAList failed, will be retried in 10000 ms (retries forever) ``` After the fix ``` bash-5.1$ ozone admin container create java.security.cert.CertificateException: org.apache.hadoop.security.AccessControlException: Permission denied. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
