[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204367#comment-15204367
 ] 

ASF GitHub Bot commented on STORM-956:
--------------------------------------

Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-199321366
  
    The hang checks should now support writing errors to Zookeeper, extending 
the timeout by interacting with an OutputCollector, setting different time 
limits per component, and disabling the checks entirely by setting the 
timelimit/check frequency to null. I took a quick look at the metrics system, 
but can't really see a nice way of logging to it if we're potentially shutting 
down the worker when this system is triggered.
    
    I'm not sure the automatic/manual hang timeout resets are really necessary 
on SpoutOutputCollector, since I don't see a case where a user would want to 
hang in nextTuple while still emitting tuples. Let me know if they should be 
removed.
    
    I think this PR is ready for re-review.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -----------------------------------------------------------------------------------------
>
>                 Key: STORM-956
>                 URL: https://issues.apache.org/jira/browse/STORM-956
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Chuanlei Ni
>            Assignee: Chuanlei Ni
>            Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to