[ https://issues.apache.org/jira/browse/HBASE-26468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rushabh Shah updated HBASE-26468: --------------------------------- Description: Observed this in our production cluster running 1.6 version. RS crashed due to some reason but the process was still running. On debugging more, found out there was 1 non-daemon thread running and that was not allowing RS to exit cleanly. Our clusters are managed by Ambari and have auto restart capability within them. But since the process was running and pid file was present, Ambari also couldn't do much. There will be some bug where we will miss to stop some daemon thread but there should be some maximum amount of time we should wait before exiting the thread. Relevant code: [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java] {code:java} logProcessInfo(getConf()); HRegionServer hrs = HRegionServer.constructRegionServer(regionServerClass, conf); hrs.start(); hrs.join(); -----> This should be a timed join. if (hrs.isAborted()) { throw new RuntimeException("HRegionServer Aborted"); } } {code} was: Observed this in our production cluster running 1.6 version. RS crashed due to some reason but the process was still running. On debugging more, found out there was 1 non-daemon thread running and that was not allowing RS to stop cleanly. Our clusters are managed by Ambari and have auto restart capability within them. But since the process was running and pid file was present, Ambari also couldn't do much. There will be some bug where we will miss to stop some daemon thread but there should be some maximum amount of time we should wait before exiting the thread. Relevant code: [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java] {code:java} logProcessInfo(getConf()); HRegionServer hrs = HRegionServer.constructRegionServer(regionServerClass, conf); hrs.start(); hrs.join(); -----> This should be a timed join. if (hrs.isAborted()) { throw new RuntimeException("HRegionServer Aborted"); } } {code} > Region Server doesn't exit cleanly incase it crashes. > ----------------------------------------------------- > > Key: HBASE-26468 > URL: https://issues.apache.org/jira/browse/HBASE-26468 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.6.0 > Reporter: Rushabh Shah > Priority: Major > > Observed this in our production cluster running 1.6 version. > RS crashed due to some reason but the process was still running. On debugging > more, found out there was 1 non-daemon thread running and that was not > allowing RS to exit cleanly. Our clusters are managed by Ambari and have auto > restart capability within them. But since the process was running and pid > file was present, Ambari also couldn't do much. There will be some bug where > we will miss to stop some daemon thread but there should be some maximum > amount of time we should wait before exiting the thread. > Relevant code: > [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java] > {code:java} > logProcessInfo(getConf()); > HRegionServer hrs = > HRegionServer.constructRegionServer(regionServerClass, conf); > hrs.start(); > hrs.join(); -----> This should be a timed join. > if (hrs.isAborted()) { > throw new RuntimeException("HRegionServer Aborted"); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)