Bryan Beaudreault created HBASE-28365: -----------------------------------------
Summary: ChaosMonkey batch suspend/resume action assume shell implementation Key: HBASE-28365 URL: https://issues.apache.org/jira/browse/HBASE-28365 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault These two actions have code like this: {code:java} case SUSPEND: server = serversToBeSuspended.remove(); try { suspendRs(server); } catch (Shell.ExitCodeException e) { LOG.warn("Problem suspending but presume successful; code={}", e.getExitCode(), e); } suspendedServers.add(server); break; {code} This only catches that one Shell.ExitCodeException, but operators may have an implementation of ClusterManager which does not use shell. We should expand this to catch all exceptions. The implication here is that the uncaught exception propagates, and we don't add the server to suspendedServers. If the suspension actually succeeded, this leaves some processes in a permanently suspended state until manual intervention occurs. -- This message was sent by Atlassian Jira (v8.20.10#820010)