[ https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204367#comment-15204367 ]
ASF GitHub Bot commented on STORM-956: -------------------------------------- Github user srdo commented on the pull request: https://github.com/apache/storm/pull/1209#issuecomment-199321366 The hang checks should now support writing errors to Zookeeper, extending the timeout by interacting with an OutputCollector, setting different time limits per component, and disabling the checks entirely by setting the timelimit/check frequency to null. I took a quick look at the metrics system, but can't really see a nice way of logging to it if we're potentially shutting down the worker when this system is triggered. I'm not sure the automatic/manual hang timeout resets are really necessary on SpoutOutputCollector, since I don't see a case where a user would want to hang in nextTuple while still emitting tuples. Let me know if they should be removed. I think this PR is ready for re-review. > When the execute() or nextTuple() hang on external resources, stop the > Worker's heartbeat > ----------------------------------------------------------------------------------------- > > Key: STORM-956 > URL: https://issues.apache.org/jira/browse/STORM-956 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core > Reporter: Chuanlei Ni > Assignee: Chuanlei Ni > Priority: Minor > Original Estimate: 6h > Remaining Estimate: 6h > > Sometimes the work threads produced by mk-threads in executor.clj hang on > external resources or other unknown reasons. This makes the workers stop > processing the tuples. I think it is better to kill this worker to resolve > the "hang". I plan to : > 1. like `setup-ticks`, send a system-tick to receive-queue > 2. the tuple-action-fn deal with this system-tick and remember the time that > processes this tuple in the executor-data > 3. when worker do local heartbeat, check the time the executor writes to > executor-data. If the time is long from current (for example, 3 minutes), the > worker does not do the heartbeat. So the supervisor could deal with this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)