[
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989591#comment-14989591
]
Thomas Graves commented on STORM-1155:
--------------------------------------
[~LongdaFeng] I don't quite follow your question? Do you have a concern with
the proposed solution?
The proposed interface and patch allow the user to write any number of scripts
and place them in the health check dir to get run. I believe you can do all
the things you mention with these scripts unless those things put the system in
such a bad state they can't do anything. yes you have to write them yourself
but I think that makes sense based on the number of different setups people
would use.
[~basti.lj] thanks for the inquiry. I understand your concern. the interface
chosen here is the same as what Hadoop supports. We chose this so scripts
could be used across both hadoop clusters and storm clusters. If the scripts
determine the node is unhealthy and shut down the supervisor then there is no
way to control or monitor the workers running on it so it seems to make sense
that we kill the workers also. If you don't want to shut down of the
supervisor on certain conditions then it shouldn't be in the health check
scripts. These checks are specifically for external things or node health
scripts and the idea would be the node is in bad state so nothing should run on
it. I think there are another category of issues like you mention that would
fall more into a black listing feature - STORM-909.
Do you have specific scenarios in mind? I think the memory one you mention
should be resolved with the resource aware scheduler.
> Supervisor recurring health checks
> ----------------------------------
>
> Key: STORM-1155
> URL: https://issues.apache.org/jira/browse/STORM-1155
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: Thomas Graves
> Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin.
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to
> execute properly so you don't want to mark the node as unhealthy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)