subject:"Supervisor kills \*all\* workers for topology due to heartbeat \:timed\-out state"

Re: Supervisor kills all workers for topology due to heartbeat :timed-out state

2014-06-16 Thread Derek Dagit


:timed-out means that the worker did not heartbeat to the supervisor in time. 
(This happens on local disk.)

Check that your workers have enough jvm heap space.  If not, garbage collection 
for the JVM will cause progressively slower heartbeats until the supervisor 
thinks they are dead and kills them.

topology.worker.childopts="-Xmx{{VALUE}}" e.g. 2048m or 2g

--
Derek

On 6/14/14, 22:39, Justin Workman wrote:

 From what I have seen, if nimbus kills and reassigns the worker process,
the supervisor logs will report that the worker is in a "disallowed" state.

I have seen the supervisor report the worker in a timed out state and
restart the worker processes, generally when the system is under heavy CPU
load. We recently ran into this issue while running a topology on virtual
machines. Increasing the number of virtual cores assigned to the vm's
resolved the restart issues.

Thanks
Justin

Sent from my iPhone

On Jun 14, 2014, at 11:32 AM, Andrew Montalenti  wrote:

I am trying to understand why for a topology I am trying to run on
0.9.1-incubating, the supervisor on the machine is killing *all* of the
topology's Storm workers periodically.

Whether I use topology.workers=1,2,4, or 8, I always get logs like this:

https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b

Which basically indicates that the supervisor thinks all the workers timed
out at exactly the same time, and then it kills them all.

I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120
secs, but this hasn't helped at all. No matter what, periodically, the
workers just get whacked by the supervisor and the whole topology has to
restart.

I notice that this does happen less frequently if the machine is under less
load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or
200, then it runs for awhile without crashing. But I've even seen it crash
in this state.

I saw on some other threads that people indicated that the supervisor will
kill all workers if "the nimbus fails to see a heartbeat from zookeeper".
Could someone walk me through how I could figure out if this is the case?
Nothing in the logs seems to point me in this direction.

Thanks!

Andrew

Re: Supervisor kills all workers for topology due to heartbeat :timed-out state

2014-06-14 Thread Justin Workman

>From what I have seen, if nimbus kills and reassigns the worker process,
the supervisor logs will report that the worker is in a "disallowed" state.

I have seen the supervisor report the worker in a timed out state and
restart the worker processes, generally when the system is under heavy CPU
load. We recently ran into this issue while running a topology on virtual
machines. Increasing the number of virtual cores assigned to the vm's
resolved the restart issues.

Thanks
Justin

Sent from my iPhone

On Jun 14, 2014, at 11:32 AM, Andrew Montalenti  wrote:

I am trying to understand why for a topology I am trying to run on
0.9.1-incubating, the supervisor on the machine is killing *all* of the
topology's Storm workers periodically.

Whether I use topology.workers=1,2,4, or 8, I always get logs like this:

https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b

Which basically indicates that the supervisor thinks all the workers timed
out at exactly the same time, and then it kills them all.

I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120
secs, but this hasn't helped at all. No matter what, periodically, the
workers just get whacked by the supervisor and the whole topology has to
restart.

I notice that this does happen less frequently if the machine is under less
load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or
200, then it runs for awhile without crashing. But I've even seen it crash
in this state.

I saw on some other threads that people indicated that the supervisor will
kill all workers if "the nimbus fails to see a heartbeat from zookeeper".
Could someone walk me through how I could figure out if this is the case?
Nothing in the logs seems to point me in this direction.

Thanks!

Andrew

Supervisor kills all workers for topology due to heartbeat :timed-out state

2014-06-14 Thread Andrew Montalenti

I am trying to understand why for a topology I am trying to run on
0.9.1-incubating, the supervisor on the machine is killing *all* of the
topology's Storm workers periodically.

Whether I use topology.workers=1,2,4, or 8, I always get logs like this:

https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b

Which basically indicates that the supervisor thinks all the workers timed
out at exactly the same time, and then it kills them all.

I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120
secs, but this hasn't helped at all. No matter what, periodically, the
workers just get whacked by the supervisor and the whole topology has to
restart.

I notice that this does happen less frequently if the machine is under less
load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or
200, then it runs for awhile without crashing. But I've even seen it crash
in this state.

I saw on some other threads that people indicated that the supervisor will
kill all workers if "the nimbus fails to see a heartbeat from zookeeper".
Could someone walk me through how I could figure out if this is the case?
Nothing in the logs seems to point me in this direction.

Thanks!

Andrew

Re: Supervisor kills all workers for topology due to heartbeat :timed-out state

Re: Supervisor kills all workers for topology due to heartbeat :timed-out state

Supervisor kills all workers for topology due to heartbeat :timed-out state

3 matches

Site Navigation

Mail list logo

Footer information

Re: Supervisor kills *all* workers for topology due to heartbeat :timed-out state

Re: Supervisor kills *all* workers for topology due to heartbeat :timed-out state

Supervisor kills *all* workers for topology due to heartbeat :timed-out state

3 matches

Mail list logo

Re: Supervisor kills all workers for topology due to heartbeat :timed-out state

Re: Supervisor kills all workers for topology due to heartbeat :timed-out state

Supervisor kills all workers for topology due to heartbeat :timed-out state