[jira] [Commented] (STORM-1155) Supervisor recurring health checks

ASF GitHub Bot (JIRA) Wed, 04 Nov 2015 16:45:45 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990811#comment-14990811
 ]


ASF GitHub Bot commented on STORM-1155:
---------------------------------------

Github user longdafeng commented on the pull request:

    https://github.com/apache/storm/pull/849#issuecomment-153917289
  
    the code is fine to me, but this is the first step to do health check:
    (1) where put the script, we had better put the script under 
$STORM_HOME/healthcheck, because when install storm, the script will be also 
installed in every node.
    (2) the script depends on OS, in Windows, they should be xxx, in Linux, 
they should be XXX;
    maybe we need some OS detect action to judge what kind of OS is running. so 
in the "STORM_HEALTH_CHECK_DIR" dir, there are several directory,  
windows/linux/mac, different OS use different dir.
    (3) could you please help to implement healthcheck.clj with java. in a 
short while, the storm will contain two core, one is clojure, the other is 
java. so could you please help to implement healthcheck.clj. 


> Supervisor recurring health checks
> ----------------------------------
>
>                 Key: STORM-1155
>                 URL: https://issues.apache.org/jira/browse/STORM-1155
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to 
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin. 
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on 
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to 
> execute properly so you don't want to mark the node as unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1155) Supervisor recurring health checks

Reply via email to