[
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989667#comment-14989667
]
Robert Joseph Evans commented on STORM-1155:
--------------------------------------------
[~LongdaFeng] we have run into all of these situations too, glad to see we are
not alone in this. Hopefully resource aware scheduling that we have been
working on along with the cgroup support from JStorm will help take care of
some issues with bad topologies, but there will always be hardware that dies or
is in the process of dieing that will slow everything down. Storm is
especially sensitive to these slow nodes, as it does not have speculative
execution, yet.
> Supervisor recurring health checks
> ----------------------------------
>
> Key: STORM-1155
> URL: https://issues.apache.org/jira/browse/STORM-1155
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: Thomas Graves
> Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin.
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to
> execute properly so you don't want to mark the node as unhealthy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)