Bryan Beaudreault created HBASE-28365:
-----------------------------------------

             Summary: ChaosMonkey batch suspend/resume action assume shell 
implementation
                 Key: HBASE-28365
                 URL: https://issues.apache.org/jira/browse/HBASE-28365
             Project: HBase
          Issue Type: Bug
            Reporter: Bryan Beaudreault


These two actions have code like this:
{code:java}
case SUSPEND:
  server = serversToBeSuspended.remove();
  try {
    suspendRs(server);
  } catch (Shell.ExitCodeException e) {
    LOG.warn("Problem suspending but presume successful; code={}", 
e.getExitCode(), e);
  }
  suspendedServers.add(server);
  break; {code}
This only catches that one Shell.ExitCodeException, but operators may have an 
implementation of ClusterManager which does not use shell. We should expand 
this to catch all exceptions.

The implication here is that the uncaught exception propagates, and we don't 
add the server to suspendedServers. If the suspension actually succeeded, this 
leaves some processes in a permanently suspended state until manual 
intervention occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to