[ 
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989591#comment-14989591
 ] 

Thomas Graves commented on STORM-1155:
--------------------------------------

[~LongdaFeng]  I don't quite follow your question?  Do you have a concern with 
the proposed solution?
The proposed interface and patch allow the user to write any number of scripts 
and place them in the health check dir to get run.   I believe you can do all 
the things you mention with these scripts unless those things put the system in 
such a bad state they can't do anything.  yes you have to write them yourself 
but I think that makes sense based on the number of different setups people 
would use. 

[~basti.lj]  thanks for the inquiry. I understand your concern.  the interface 
chosen here is the same as what Hadoop supports.  We chose this so scripts 
could be used across both hadoop clusters and storm clusters.  If the scripts 
determine the node is unhealthy and shut down the supervisor then there is no 
way to control or monitor the workers running on it so it seems to make sense 
that we kill the workers also.  If you don't want to shut down of the 
supervisor on certain conditions then it shouldn't be in the health check 
scripts.  These checks are specifically for external things or node health 
scripts and the idea would be the node is in bad state so nothing should run on 
it.  I think there are another category of issues like you mention that would 
fall more into a black listing feature - STORM-909.

Do you have specific scenarios in mind?  I think the memory one you mention 
should be resolved with the resource aware scheduler.  

> Supervisor recurring health checks
> ----------------------------------
>
>                 Key: STORM-1155
>                 URL: https://issues.apache.org/jira/browse/STORM-1155
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to 
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin. 
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on 
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to 
> execute properly so you don't want to mark the node as unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to