[ https://issues.apache.org/jira/browse/HDDS-13353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Andika updated HDDS-13353: ------------------------------- Parent: (was: HDDS-12651) Issue Type: Bug (was: Sub-task) > SCM stuck in safe mode due to exceptions in node resolver > --------------------------------------------------------- > > Key: HDDS-13353 > URL: https://issues.apache.org/jira/browse/HDDS-13353 > Project: Apache Ozone > Issue Type: Bug > Reporter: Ivan Andika > Assignee: Ivan Andika > Priority: Major > > Our cluster uses org.apache.hadoop.net.ScriptBasedMapping as our > net.topology.node.switch.mapping.impl implementation. > However, we encountered an issue such that when the > net.topology.script.file.name is pointed to the file that the SCM has no > access, the SCM register does not seem to respond to the datanode. This > causes SCM to be stuck in safe mode indefinitely since datanode cannot > (re-)register and therefore cannot send the subsequent container reports, > etc. Furthermore, the datanode simply reports that there is a > SocketTimeoutException, which might be misleading since it is because the SCM > does not respond at all and not because of the network issues. > The SCM exceptions stack looks like > {code:java} > java.io.IOException: Cannot run program "/path/to/script.py" (in directory > "/<redacted>"): error=13, Permission denied > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) > at org.apache.hadoop.util.Shell.run(Shell.java:901) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:273) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:208) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.hdds.scm.node.SCMNodeManager.nodeResolve(SCMNodeManager.java:1283) > at > org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:397) > at > org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.register(SCMDatanodeProtocolServer.java:231) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.register(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:85) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.processMessage(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:119) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:92) > at > org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos$StorageContainerDatanodeProtocolService$2.callBlockingMethod(StorageContainerDatanodeProtocolProtos.java:43636) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:491) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:611) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1300) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1193) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2031) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3587) > Caused by: java.io.IOException: error=13, Permission denied > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.<init>(UNIXProcess.java:247) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 24 more {code} > The DN exceptions stack looks like > {code:java} > 2025-06-30 16:26:21,795 [EndpointStateMachine task thread for > <redacted>/<redacted> - 0 ] WARN > org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: > Unable to communicate to SCM server at <redacted>:9861 for past 0 seconds. > java.net.SocketTimeoutException: Call From <redacted>/<redacted> to > <redacted>:9861 failed on socket timeout exception: > java.net.SocketTimeoutException: 5000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/<redacted>:38530 remote=<redacted>/<redacted>9861]; For more details > see: http://wiki.apache.org/hadoop/SocketTimeout > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:866) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583) > at org.apache.hadoop.ipc.Client.call(Client.java:1511) > at org.apache.hadoop.ipc.Client.call(Client.java:1402) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135) > at com.sun.proxy.$Proxy38.submitRequest(Unknown Source) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:117) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:149) > at > org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:184) > at > org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:86) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.SocketTimeoutException: 5000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.74.25.77:38530 > remote=o<redacted>/<redacted>:9861] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:524) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1908) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1182) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1071) > {code} > Ideally, SCM should return an exception to DN so that DN can print the > message and maybe retry. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org