[ 
https://issues.apache.org/jira/browse/HBASE-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551862#comment-16551862
 ] 

Xu Cang commented on HBASE-20158:
---------------------------------

Nice JIRA, looking forward to seeing more info. I have some small comments 
below:

 

"However this won't work if the server's system resource has ran out, for 
example no new native thread could be created, no new network connection could 
be setup, etc. Notice that although no new thread could not be launched, 
running thread won't be affected so zookeeper session is still alive and RS 
still regarded as alive, but clients cannot access since no new connection 
could be setup."

 

When this happens, the external script as you mentioned will either fail or 
couldn't send metrics to the monitoring system. By alerting on failure 
datapoints or missing data, I don't expect it's too hard to catch the 
production issue quickly.  

 

"In this new checker we won't launch any outer script, but picking some regions 
on the RS and send some rpc request to itself, regarding the server as 
unhealthy if the call failure ratio exceeds some limit, and send the metrics 
out to our monitoring system. More details please refer to the coming patch"

 

We have seen cases when RPC queue is filled and server could not handle rpcs 
anymore. This approach will suffer from it too. 

 

> Enhance regionserver self health check to avoid stale server
> ------------------------------------------------------------
>
>                 Key: HBASE-20158
>                 URL: https://issues.apache.org/jira/browse/HBASE-20158
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Major
>
> Currently we have many good metrics to monitor our cluster status, such as 
> totalCallTime/processCallTime/queueCallTime etc. But these metrics won't work 
> if server got stale and the client call timed out, for example during RS 
> fullgc or there're some bad disk on HDFS and the read IO got stuck.
> We also have a periodic health check chore introduced by HBASE-7351 which 
> allow us to launch some external script periodically to perform some self 
> detection. However this won't work if the server's system resource has ran 
> out, for example no new native thread could be created, no new network 
> connection could be setup, etc. Notice that although no new thread could not 
> be launched, running thread won't be affected so zookeeper session is still 
> alive and RS still regarded as alive, but clients cannot access since no new 
> connection could be setup.
> Here we propose a new HealthChecker called DirectHealthChecker. In this new 
> checker we won't launch any outer script, but picking some regions on the RS 
> and send some rpc request to itself, regarding the server as unhealthy if the 
> call failure ratio exceeds some limit, and send the metrics out to our 
> monitoring system. More details please refer to the coming patch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to