regionserver should do basic health check before reporting alls-well to the
master
----------------------------------------------------------------------------------
Key: HBASE-611
URL: https://issues.apache.org/jira/browse/HBASE-611
Project: Hadoop HBase
Issue Type: Improvement
Affects Versions: 0.1.2
Reporter: stack
Priority: Minor
Fix For: 0.2.0
On IRC this afternoon, a user killed a regionserver. It did something in HDFS.
Another regionserver, one carrying the catalog tables, started to get
exceptions out of HDFS. The last thing out of it was:
{code}
[15:55] <jgray> 2008-05-01 15:49:51,710 FATAL
org.apache.hadoop.hbase.HRegionServer: Replay of hlog required. Forcing server
restart
[15:55] <jgray> org.apache.hadoop.hbase.DroppedSnapshotException: Could not get
block locations. Aborting...
{code}
Thats fine.
Only it didn't go down... it was in a state where it continued to send the
master pings as though nothing was wrong so its lease never timed out and
master was hosed because it couldn't get to catalog tables.
Regionservers should do a basic check that alls-healthy before they ping the
master. If critical threads have exited or a flag saying hdfs has been found
bad has been set, then regionserver should stop reporting the master so master
can deploy its load elsewhere.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.