[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192961#comment-15192961
 ] 

ASF GitHub Bot commented on STORM-956:
--------------------------------------

Github user hustfxj commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-196220384
  
    Spout itself emits messages by SpoutOutputCollector 's emit().  If lots of 
messages failed, then acker will trigger SpoutOutputCollector emits those 
failed messages. It may happen dead lock. Because down bolts may slow to handle 
messsages and it will block emit(),  then spout/acker thread will block.  Thus 
others messages which is send by those can't be handled by acker. So the bolts 
will block. The scene may be called "loop dead lock".  I want say that this PR 
is sound to this scene. Because It can make us find the dead lock in time.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -----------------------------------------------------------------------------------------
>
>                 Key: STORM-956
>                 URL: https://issues.apache.org/jira/browse/STORM-956
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Chuanlei Ni
>            Assignee: Chuanlei Ni
>            Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to