[ https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012690#comment-15012690 ]
ASF GitHub Bot commented on STORM-956: -------------------------------------- Github user bastiliu commented on the pull request: https://github.com/apache/storm/pull/647#issuecomment-157935012 @revans2 Yes, I agree that this checking is helpful to find the problem spout/bolt. My point here is that the solution could be improved. 1. If timeout, it is better to raise a warning(e.g. give a warning on web UI). Because we have seen some topologys that might require to block at execute()/nextTuple() to wait some essential initialization. e.g. the connection to database in a bolt is down. The user would like to wait untill the reconnection is done. 2. The triggering mechanism of "last-active-time" timeout should be updated. Current implementation puts a "last-active-time" tuple to receiving queue, then spout/bolt update the "last-active-time" when retrieving the trigger tuple from receiving queue. But it is possible that there already have been many tuples in receiving queue before putting the "last-active-time" trigger tuple. So the spout/bolt must process all the tuples which are put into receiving queue before the trigger tuple. The processing of total topology tuples might take a long time which probably cause the timeout, even if the processing time of a tuple is short. From user's point of view, that is unexpected. > When the execute() or nextTuple() hang on external resources, stop the > Worker's heartbeat > ----------------------------------------------------------------------------------------- > > Key: STORM-956 > URL: https://issues.apache.org/jira/browse/STORM-956 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core > Reporter: Chuanlei Ni > Assignee: Chuanlei Ni > Priority: Minor > Original Estimate: 6h > Remaining Estimate: 6h > > Sometimes the work threads produced by mk-threads in executor.clj hang on > external resources or other unknown reasons. This makes the workers stop > processing the tuples. I think it is better to kill this worker to resolve > the "hang". I plan to : > 1. like `setup-ticks`, send a system-tick to receive-queue > 2. the tuple-action-fn deal with this system-tick and remember the time that > processes this tuple in the executor-data > 3. when worker do local heartbeat, check the time the executor writes to > executor-data. If the time is long from current (for example, 3 minutes), the > worker does not do the heartbeat. So the supervisor could deal with this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)