[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-04-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232170#comment-15232170
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-207432109
  
@hustfxj Please take another look when you have time, I think this feature 
is done :)

The test failure seems to occur intermittently in storm-kafka, looks 
unrelated. It ran for me locally with all-tests.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-04-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232147#comment-15232147
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo closed the pull request at:

https://github.com/apache/storm/pull/1209


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-04-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232148#comment-15232148
 ] 

ASF GitHub Bot commented on STORM-956:
--

GitHub user srdo reopened a pull request:

https://github.com/apache/storm/pull/1209

STORM-956: When the execute() or nextTuple() hang on external resources, 
stop the Worker's heartbeat

The previous PR at https://github.com/apache/storm/pull/647 doesn't look 
active anymore. Having Storm tell you which components are backing up would 
still be a nice feature to have.

I've taken a look at implementing the suggestions from the previous PR, but 
I have a few questions.

The previous discussion seemed to point toward shutting down the worker 
when an executor is hanging. I'm guessing there's no nice way to just restart 
the hanging executors? Is it sufficient to call shutdown on the worker object 
from do-executor-heartbeats?

I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be 
ignored when checking for hanging executors?

I'm hoping to add the zookeeper/metrics logging and shutdown functionality 
soon if the idea of this PR is sound.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1209


commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei 
Date:   2015-07-22T07:37:28Z

stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei 
Date:   2015-07-22T08:55:39Z

add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing 
Date:   2016-03-13T14:57:27Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing 
Date:   2016-03-13T18:58:29Z

STORM-956: Add support for warning about hanging executors

commit af0a56df27f6d4765dd868cf85c0633832cd8a72
Author: Stig Døssing 
Date:   2016-03-14T20:45:12Z

STORM-956: Put hang check in its own function, added worker shutdown call, 
scheduled hang check interval to match lowest configured timeout.

commit 9bb475213b18d1bbac9277f04dba8381e7a2fa2a
Author: Stig Døssing 
Date:   2016-03-15T19:29:00Z

STORM-956: Log error in Zookeeper when executor is hanging

commit 1a7fd227eade0c205acc1a23aa80ce3e3b845818
Author: Stig Døssing 
Date:   2016-03-18T12:03:21Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 159a169e2cdf475fb69a3895fc354b9729d0bb6f
Author: Stig Døssing 
Date:   2016-03-20T09:14:46Z

STORM-956: Added support for extending hang timeout via outputcollectors. 
Added tests for zk error logging, per-component configuration, disabling hang 
checks and hang checks warning and shutting down worker properly.

commit 5000a78fa5b7a43f49e4dbdec7ccf7f87714cb70
Author: Stig Døssing 
Date:   2016-03-21T14:55:49Z

Add comment to Config about disabling hang checking

commit 3396fdc48647a49b1c44d59c3cbe09c098376c4a
Author: Stig Rohde Døssing 
Date:   2016-03-24T21:04:19Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 76b090746baca87719de5fdbeedbb4a8a7f75aed
Author: Stig Rohde Døssing 
Date:   2016-04-04T13:14:03Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit b6f963387b6ef4d9153481dbd1faf502bbabecf5
Author: Stig Rohde Døssing 
Date:   2016-04-07T07:32:34Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit bb2585f1db71e7523b5a62572b540e4773805a53
Author: Stig Rohde Døssing 
Date:   2016-04-08T10:49:38Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit f717555bf4e3c4c8bf2ba0639694c4486ceb4e73
Author: Stig Rohde Døssing 
Date:   2016-04-08T12:37:46Z

STORM-956: Remove automatic notifyNotHanging from outputcollector methods




> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>

[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204480#comment-15204480
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-199348649
  
Sorry, pressed the close button by accident.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204478#comment-15204478
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo closed the pull request at:

https://github.com/apache/storm/pull/1209


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204477#comment-15204477
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-199348274
  
The test failure looks to be the same as on master 
https://travis-ci.org/apache/storm/jobs/117200843. It ran locally for me with 
all-tests.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204368#comment-15204368
 ] 

ASF GitHub Bot commented on STORM-956:
--

GitHub user srdo reopened a pull request:

https://github.com/apache/storm/pull/1209

STORM-956: When the execute() or nextTuple() hang on external resources, 
stop the Worker's heartbeat

The previous PR at https://github.com/apache/storm/pull/647 doesn't look 
active anymore. Having Storm tell you which components are backing up would 
still be a nice feature to have.

I've taken a look at implementing the suggestions from the previous PR, but 
I have a few questions.

The previous discussion seemed to point toward shutting down the worker 
when an executor is hanging. I'm guessing there's no nice way to just restart 
the hanging executors? Is it sufficient to call shutdown on the worker object 
from do-executor-heartbeats?

I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be 
ignored when checking for hanging executors?

I'm hoping to add the zookeeper/metrics logging and shutdown functionality 
soon if the idea of this PR is sound.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1209


commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei 
Date:   2015-07-22T07:37:28Z

stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei 
Date:   2015-07-22T08:55:39Z

add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing 
Date:   2016-03-13T14:57:27Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing 
Date:   2016-03-13T18:58:29Z

STORM-956: Add support for warning about hanging executors

commit af0a56df27f6d4765dd868cf85c0633832cd8a72
Author: Stig Døssing 
Date:   2016-03-14T20:45:12Z

STORM-956: Put hang check in its own function, added worker shutdown call, 
scheduled hang check interval to match lowest configured timeout.

commit 9bb475213b18d1bbac9277f04dba8381e7a2fa2a
Author: Stig Døssing 
Date:   2016-03-15T19:29:00Z

STORM-956: Log error in Zookeeper when executor is hanging

commit 1a7fd227eade0c205acc1a23aa80ce3e3b845818
Author: Stig Døssing 
Date:   2016-03-18T12:03:21Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 159a169e2cdf475fb69a3895fc354b9729d0bb6f
Author: Stig Døssing 
Date:   2016-03-20T09:14:46Z

STORM-956: Added support for extending hang timeout via outputcollectors. 
Added tests for zk error logging, per-component configuration, disabling hang 
checks and hang checks warning and shutting down worker properly.




> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204367#comment-15204367
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-199321366
  
The hang checks should now support writing errors to Zookeeper, extending 
the timeout by interacting with an OutputCollector, setting different time 
limits per component, and disabling the checks entirely by setting the 
timelimit/check frequency to null. I took a quick look at the metrics system, 
but can't really see a nice way of logging to it if we're potentially shutting 
down the worker when this system is triggered.

I'm not sure the automatic/manual hang timeout resets are really necessary 
on SpoutOutputCollector, since I don't see a case where a user would want to 
hang in nextTuple while still emitting tuples. Let me know if they should be 
removed.

I think this PR is ready for re-review.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203177#comment-15203177
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-198878063
  
Will reopen this when the PR is closer to done


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203178#comment-15203178
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo closed the pull request at:

https://github.com/apache/storm/pull/1209


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196773#comment-15196773
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user hustfxj commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-197155218
  
@srdo  I mean that we can report errors into Zookeeper whether the option 
is enabled. And the metrisc is not only record the counter, but also the 
timeout time. 


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196041#comment-15196041
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-196988005
  
I'll add some tests that errors are going into Zookeeper when executors 
hang soon/that workers are getting killed only if the option is enabled, and 
will take a look at adding a function to OutputCollector for manually updating 
the last hang check timestamp. Could you elaborate on what you'd like this code 
to do regarding metrics? Is it just a counter of how many times a hanging 
executor has been found?


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192961#comment-15192961
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user hustfxj commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-196220384
  
Spout itself emits messages by SpoutOutputCollector 's emit().  If lots of 
messages failed, then acker will trigger SpoutOutputCollector emits those 
failed messages. It may happen dead lock. Because down bolts may slow to handle 
messsages and it will block emit(),  then spout/acker thread will block.  Thus 
others messages which is send by those can't be handled by acker. So the bolts 
will block. The scene may be called "loop dead lock".  I want say that this PR 
is sound to this scene. Because It can make us find the dead lock in time.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192943#comment-15192943
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user hustfxj commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-196213692
  
It looks to good.  I also hope we should see this done through both the 
metrics system and through writing an error into zookeeper that would show up 
on the UI for the component that is stuck as @revans2  @bastiliu  said. Then 
let users manually see what is happening.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192494#comment-15192494
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user srdo commented on the pull request:

https://github.com/apache/storm/pull/1209#issuecomment-196028832
  
I don't want this merged yet, just posted to get feedback :)


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2016-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192484#comment-15192484
 ] 

ASF GitHub Bot commented on STORM-956:
--

GitHub user srdo opened a pull request:

https://github.com/apache/storm/pull/1209

STORM-956: When the execute() or nextTuple() hang on external resources, 
stop the Worker's heartbeat

The previous PR at https://github.com/apache/storm/pull/647 doesn't look 
active anymore. Having Storm tell you which components are backing up would 
still be a nice feature to have.

I've taken a look at implementing the suggestions from the previous PR, but 
I have a few questions.

The previous discussion seemed to point toward shutting down the worker 
when an executor is hanging. I'm guessing there's no nice way to just restart 
the hanging executors? Is it sufficient to call shutdown on the worker object 
from do-executor-heartbeats?

I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be 
ignored when checking for hanging executors?

I'm hoping to add the zookeeper/metrics logging and shutdown functionality 
soon if this PR looks like it's going in the right direction.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1209


commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei 
Date:   2015-07-22T07:37:28Z

stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei 
Date:   2015-07-22T08:55:39Z

add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing 
Date:   2016-03-13T14:57:27Z

Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing 
Date:   2016-03-13T18:58:29Z

STORM-956: Add support for warning about hanging executors




> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013674#comment-15013674
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-158085350
  
To make a feature like this work for storm, we need to provide people a lot 
of control over it.  I personally would prefer to see something where the 
Bolt/Spout can indicate life without completely processing a tuple.  Probably 
an API in OutputCollector that would allow it to heartbeat in saying don't 
shoot me.  Any other interaction with an OutputCollection should also prevent 
the component from being shot.

The timeout should be configurable on a per-component basis, including 
turning them off.  Having all of them set to the same cluster wide setting is 
not good.  Also if we detect that we are in a bad state there is no reason to 
stop heart-beating, and wait for a supervisor to shoot us.  If we think we are 
bad just exit.  It will let recovery happen much faster.

I agree with @bastiliu too that one of the options should be not to shot 
ourselves, but to alert that we are seeing problems.  I would like to see this 
done through both the metrics system and through writing an error into 
zookeeper that would show up on the UI for the component that is stuck.  The 
metrics would allow for automated alerting and the UI would let users manually 
see what is happening.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011248#comment-15011248
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-157757168
  
I personally can see both sides of this.  There are situations where a 
bolt/spout may be hung because of a bug, like a thread deadlock in some 
external client, and restarting the worker will fix the issue.  But I agree 
that this should be a very rare situation.  

I don't get the argument that this is expensive if we get it wrong.  We 
already fail fast if some bolt or spout throws an unexpected exception.  I 
don't see this being that different.

A bolt or spout not able to process anything for 5 mins seems like an OK 
time to see if we can restart things, especially for a low latency framework.  
I personally am +1 on the concept of having timeouts.  I would like to see some 
changes to the implementation of this patch, but there is no reason to go into 
that if @kishorvpatil has a -1 on even the idea of it.  @kishorvpatil and 
@bastiliu have I swayed you at all with my argument?


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011911#comment-15011911
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user kishorvpatil commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-157856771
  
@revans2 Yes, you have convinced me on conceptually it option where 
executor could be truely deadlocked. But implementation wise, I think we need 
better control around how/when we choose to shoot worker itself. 


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012552#comment-15012552
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user hustfxj commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-157918575
  
@revans2 Now  I personally am indecisive on the concept of having timeouts. 
Sometimes execute()()  maybe need a long time due to data arrival.  
Of course, having timeouts maybe is a good choice. But the restart or other 
operations should dependency on users themselves. So can we do it by Topology 
Hooks? 


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012690#comment-15012690
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user bastiliu commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-157935012
  
@revans2 Yes, I agree that this checking is helpful to find the problem 
spout/bolt. My point here is that the solution could be improved.
1. If timeout, it is better to raise a warning(e.g. give a warning on web 
UI). Because we have seen some topologys that might require to block at 
execute()/nextTuple() to wait some essential initialization. e.g. the 
connection to database in a bolt is down. The user would like to wait untill 
the reconnection is done.  
2. The triggering mechanism of "last-active-time" timeout should be 
updated. Current implementation puts a "last-active-time" tuple to receiving 
queue, then spout/bolt update the "last-active-time" when retrieving the 
trigger tuple from receiving queue. But it is possible that there already have 
been many tuples in receiving queue before putting the "last-active-time" 
trigger tuple. So the spout/bolt must process all the tuples which are put into 
receiving queue before the trigger tuple. The processing of total topology 
tuples might take a long time which probably cause the timeout, even if the 
processing time of a tuple is short. From user's point of view, that is 
unexpected.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009935#comment-15009935
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user kishorvpatil commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-157558858
  
I think the spout and bolt should take care of handling hangs ( or use 
timeouts instead of making blocking calls). Also, the spout/bolt code should 
guard against creating threads that can cause unhandled exceptions/hang-ups. 
Forcing worker to not send heart-beats would make killing other components 
running on that worker - which is not desired.
Secondly, worker should not be killed unless it is certain that is the 
process issue and not external service issue - e.g.  if kafka spout hangs - 
killing worker might force it to be relaunched or scheduled may not solve the 
problem - new worker process still make another blocking call and hang-up.

Thirdly, killing worker will force relaunch/reschedule/ - forcing topology 
to be un-stabie as all other workers in loop have to reconnect to this new 
worker. In large topologies that might become a bigger problem and lead to 
domino effects and take longer to settle the topology.

-1


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006177#comment-15006177
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user bastiliu commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-156900352
  
I think it is better to raise a warning intead of stopping to update worker 
heartbeat. There are some topologys that might require to block at 
execute()/nextTuple() to wait some essential initialization (e.g. establish 
connection to database..).
For the trigger mechanism of hanging on problem, it is a potential risk 
that worker might be restart unexpectedly due to heartbeat timeout. Because we 
consume a batch of tuples each time. Even through the time of processing one 
tuple is short, the time of processing the batch could cause the timeout.


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-11-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004164#comment-15004164
 ] 

ASF GitHub Bot commented on STORM-956:
--

Github user hustfxj commented on the pull request:

https://github.com/apache/storm/pull/647#issuecomment-156471890
  
Maybe it is not good ,the time-out check about  execute()()  
should depends on the user .


> When the execute() or nextTuple() hang on external resources, stop the 
> Worker's heartbeat
> -
>
> Key: STORM-956
> URL: https://issues.apache.org/jira/browse/STORM-956
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Chuanlei Ni
>Assignee: Chuanlei Ni
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on 
> external resources or other unknown reasons. This makes the workers stop 
> processing the tuples.  I think it is better to kill this worker to resolve 
> the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that 
> processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to 
> executor-data. If the time is long from current (for example, 3 minutes), the 
> worker does not do the heartbeat.  So the supervisor could deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

2015-07-27 Thread Chuanlei Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642547#comment-14642547
 ] 

Chuanlei Ni commented on STORM-956:
---

any committer can help me review this pr?

thanks!

 When the execute() or nextTuple() hang on external resources, stop the 
 Worker's heartbeat
 -

 Key: STORM-956
 URL: https://issues.apache.org/jira/browse/STORM-956
 Project: Apache Storm
  Issue Type: Improvement
Reporter: Chuanlei Ni
Assignee: Chuanlei Ni
Priority: Minor
   Original Estimate: 6h
  Remaining Estimate: 6h

 Sometimes the work threads produced by mk-threads in executor.clj hang on 
 external resources or other unknown reasons. This makes the workers stop 
 processing the tuples.  I think it is better to kill this worker to resolve 
 the hang. I plan to :
 1. like `setup-ticks`, send a system-tick to receive-queue
 2. the tuple-action-fn deal with this system-tick and remember the time that 
 processes this tuple in the executor-data
 3. when worker do local heartbeat, check the time the executor writes to 
 executor-data. If the time is long from current (for example, 3 minutes), the 
 worker does not do the heartbeat.  So the supervisor could deal with this 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)