[ https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989591#comment-14989591 ]
Thomas Graves commented on STORM-1155: -------------------------------------- [~LongdaFeng] I don't quite follow your question? Do you have a concern with the proposed solution? The proposed interface and patch allow the user to write any number of scripts and place them in the health check dir to get run. I believe you can do all the things you mention with these scripts unless those things put the system in such a bad state they can't do anything. yes you have to write them yourself but I think that makes sense based on the number of different setups people would use. [~basti.lj] thanks for the inquiry. I understand your concern. the interface chosen here is the same as what Hadoop supports. We chose this so scripts could be used across both hadoop clusters and storm clusters. If the scripts determine the node is unhealthy and shut down the supervisor then there is no way to control or monitor the workers running on it so it seems to make sense that we kill the workers also. If you don't want to shut down of the supervisor on certain conditions then it shouldn't be in the health check scripts. These checks are specifically for external things or node health scripts and the idea would be the node is in bad state so nothing should run on it. I think there are another category of issues like you mention that would fall more into a black listing feature - STORM-909. Do you have specific scenarios in mind? I think the memory one you mention should be resolved with the resource aware scheduler. > Supervisor recurring health checks > ---------------------------------- > > Key: STORM-1155 > URL: https://issues.apache.org/jira/browse/STORM-1155 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core > Reporter: Thomas Graves > Assignee: Thomas Graves > > Add the ability for the supervisor to call out to health check scripts to > allow some validation of the health of the node the supervisor is running on. > It could regularly run scripts in a directory provided by the cluster admin. > If any scripts fail, it should kill the workers and stop itself. > This could work very much like the Hadoop scripts and if ERROR is returned on > stdout it means the node has some issue and we should shut down. > If a non-zero exit code is returned it indicates that the scripts failed to > execute properly so you don't want to mark the node as unhealthy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)