[
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009935#comment-15009935
]
ASF GitHub Bot commented on STORM-956:
--------------------------------------
Github user kishorvpatil commented on the pull request:
https://github.com/apache/storm/pull/647#issuecomment-157558858
I think the spout and bolt should take care of handling hangs ( or use
timeouts instead of making blocking calls). Also, the spout/bolt code should
guard against creating threads that can cause unhandled exceptions/hang-ups.
Forcing worker to not send heart-beats would make killing other components
running on that worker - which is not desired.
Secondly, worker should not be killed unless it is certain that is the
process issue and not external service issue - e.g. if kafka spout hangs -
killing worker might force it to be relaunched or scheduled may not solve the
problem - new worker process still make another blocking call and hang-up.
Thirdly, killing worker will force relaunch/reschedule/ - forcing topology
to be un-stabie as all other workers in loop have to reconnect to this new
worker. In large topologies that might become a bigger problem and lead to
domino effects and take longer to settle the topology.
-1
> When the execute() or nextTuple() hang on external resources, stop the
> Worker's heartbeat
> -----------------------------------------------------------------------------------------
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: Chuanlei Ni
> Assignee: Chuanlei Ni
> Priority: Minor
> Original Estimate: 6h
> Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on
> external resources or other unknown reasons. This makes the workers stop
> processing the tuples. I think it is better to kill this worker to resolve
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to
> executor-data. If the time is long from current (for example, 3 minutes), the
> worker does not do the heartbeat. So the supervisor could deal with this
> problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)