[ https://issues.apache.org/jira/browse/HBASE-26468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447467#comment-17447467 ]
Duo Zhang commented on HBASE-26468: ----------------------------------- Maybe we could add a delay? For example, if the process does not exit for 30 seconds, we call System.exit to force quit, and the return value should be something other than 0 to indicate that this is a force terminate. > Region Server doesn't exit cleanly incase it crashes. > ----------------------------------------------------- > > Key: HBASE-26468 > URL: https://issues.apache.org/jira/browse/HBASE-26468 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.6.0 > Reporter: Rushabh Shah > Assignee: Rushabh Shah > Priority: Major > Fix For: 2.5.0, 3.0.0-alpha-2, 1.7.2, 2.3.8, 2.4.9 > > > Observed this in our production cluster running 1.6 version. > RS crashed due to some reason but the process was still running. On debugging > more, found out there was 1 non-daemon thread running and that was not > allowing RS to exit cleanly. Our clusters are managed by Ambari and have auto > restart capability within them. But since the process was running and pid > file was present, Ambari also couldn't do much. There will be some bug where > we will miss to stop some non daemon thread. Shutdown hook will not be called > unless one of the following 2 conditions are met: > # The Java virtual machine shuts down in response to two kinds of events: > The program exits normally, when the last non-daemon thread exits or when the > exit (equivalently, System.exit) method is invoked, or > # The virtual machine is terminated in response to a user interrupt, such as > typing ^C, or a system-wide event, such as user logoff or system shutdown. > Considering the first condition, when the last non-daemon thread exits or > when the exit method is invoked. > Below is the code snippet from > [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java#L51] > {code:java} > private int start() throws Exception { > try { > if (LocalHBaseCluster.isLocal(conf)) { > // Ignore this. > } else { > HRegionServer hrs = > HRegionServer.constructRegionServer(regionServerClass, conf); > hrs.start(); > hrs.join(); > if (hrs.isAborted()) { > throw new RuntimeException("HRegionServer Aborted"); > } > } > } catch (Throwable t) { > LOG.error("Region server exiting", t); > return 1; > } > return 0; > } > {code} > Within HRegionServer, there is a subtle difference between when a server is > aborted v/s when it is stopped. If it is stopped, then isAborted will return > false and it will exit with return code 0. > Below is the code from > [ServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ServerCommandLine.java#L147] > {code:java} > public void doMain(String args[]) { > try { > int ret = ToolRunner.run(HBaseConfiguration.create(), this, args); > if (ret != 0) { > System.exit(ret); > } > } catch (Exception e) { > LOG.error("Failed to run", e); > System.exit(-1); > } > } > {code} > If return code is 0, then it won't call System.exit. This means JVM will wait > to call ShutdownHook until all non daemon threads are stopped which means > infinite wait if we don't close all non-daemon threads cleanly. -- This message was sent by Atlassian Jira (v8.20.1#820001)