Joe McDonnell created IMPALA-13132: -------------------------------------- Summary: Ozone jobs see intermittent termination of Ozone manager / HMS fails to start Key: IMPALA-13132 URL: https://issues.apache.org/jira/browse/IMPALA-13132 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell
Ozone jobs load data/metadata snapshots during dataload, then restarts the cluster. On this restart, the HMS sometimes fails to come up: {noformat} 16:04:13 --> Starting Hive Metastore Service 16:04:13 No handlers could be found for logger "thrift.transport.TSocket" 16:04:14 Waiting for the Metastore at localhost:9083... ... 16:09:14 Waiting for the Metastore at localhost:9083... 16:09:14 Metastore service failed to start within 300.0 seconds.{noformat} In the metastore logs, we see messages like this: {noformat} 2024-06-04T08:37:06,425 INFO [main] retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From hostname/127.0.0.1 to localhost:9862 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 failover attempts. Trying to failover after sleeping for 4000ms.{noformat} It's trying to talk to the Ozone manager. The Ozone cluster was back up and running before trying to start the HMS, but then the Ozone manager received a signal and shutdown: {noformat} 24/06/04 08:36:37 ERROR om.OzoneManagerStarter: RECEIVED SIGNAL 15: SIGTERM 24/06/04 08:36:37 INFO om.OzoneManagerStarter: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down OzoneManager at hostname/127.0.0.1 ************************************************************/ 24/06/04 08:36:37 INFO om.OzoneManager: om1[localhost:9862]: Stopping Ozone Manager{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)