[
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988870#comment-14988870
]
Basti Liu edited comment on STORM-1155 at 11/4/15 4:51 AM:
-----------------------------------------------------------
Hi Zhuo,
Yes, I agree that the running workers might also cause unexpected error when
health check is failed. But not all scenarios can demonstrate running workers
are in an abnormal status, e.g. lack enough system memory for a new worker.
Even though for some unexpected condition like no availabe disk space or
network connection, worker will kill itself or will be re-assigned by nimbus
(network connection failure will canse heartbeat timeout).
[~zhuoliu] [~tgraves]
So, do you think if it is reasonable to classify the follow-up operation when
health check of supervisor is failed, e.g. user can config which health checks
scripts will cause the shutdown of supervisor, and which checks will kill all
running workers?
was (Author: basti.lj):
Hi Zhuo,
Yes, I agree that the running workers might also cause unexpected error when
health check is failed. But not all scenarios can demonstrate running workers
are in an abnormal status, e.g. lack enough system memory for a new worker.
Even though for some unexpected condition like no availabe disk space or
network connection, worker will kill itself or will be re-assigned by nimbus
(network connection failure will canse heartbeat timeout).
[~zhuoliu] [~tgraves]
So, do you think it is reasonable to classify the follow-up operation when
health check of supervisor is failed, e.g. user can config which health checks
scripts will cause the shutdown of supervisor, and which checks will kill all
running workers.
> Supervisor recurring health checks
> ----------------------------------
>
> Key: STORM-1155
> URL: https://issues.apache.org/jira/browse/STORM-1155
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: Thomas Graves
> Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin.
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to
> execute properly so you don't want to mark the node as unhealthy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)