[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012690#comment-15012690
 ] 

ASF GitHub Bot commented on STORM-956:
--------------------------------------

Github user bastiliu commented on the pull request:

    https://github.com/apache/storm/pull/647#issuecomment-157935012
  
    @revans2 Yes, I agree that this checking is helpful to find the problem 
spout/bolt. My point here is that the solution could be improved.
    1. If timeout, it is better to raise a warning(e.g. give a warning on web 
UI). Because we have seen some topologys that might require to block at 
execute()/nextTuple() to wait some essential initialization. e.g. the 
connection to database in a bolt is down. The user would like to wait untill 
the reconnection is done.  
    2. The triggering mechanism of "last-active-time" timeout should be 
updated. Current implementation puts a "last-active-time" tuple to receiving 
queue, then spout/bolt update the "last-active-time" when retrieving the 
trigger tuple from receiving queue. But it is possible that there already have 
been many tuples in receiving queue before putting the "last-active-time" 
trigger tuple. So the spout/bolt must process all the tuples which are put into 
receiving queue before the trigger tuple. The processing of total topology 
tuples might take a long time which probably cause the timeout, even if the 
processing time of a tuple is short. From user's point of view, that is 
unexpected.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -----------------------------------------------------------------------------------------
>
>                 Key: STORM-956
>                 URL: https://issues.apache.org/jira/browse/STORM-956
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Chuanlei Ni
>            Assignee: Chuanlei Ni
>            Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to