Re: Supervisor kills *all* workers for topology due to heartbeat :timed-out state
:timed-out means that the worker did not heartbeat to the supervisor in time. (This happens on local disk.) Check that your workers have enough jvm heap space. If not, garbage collection for the JVM will cause progressively slower heartbeats until the supervisor thinks they are dead and kills them. topology.worker.childopts="-Xmx{{VALUE}}" e.g. 2048m or 2g -- Derek On 6/14/14, 22:39, Justin Workman wrote: From what I have seen, if nimbus kills and reassigns the worker process, the supervisor logs will report that the worker is in a "disallowed" state. I have seen the supervisor report the worker in a timed out state and restart the worker processes, generally when the system is under heavy CPU load. We recently ran into this issue while running a topology on virtual machines. Increasing the number of virtual cores assigned to the vm's resolved the restart issues. Thanks Justin Sent from my iPhone On Jun 14, 2014, at 11:32 AM, Andrew Montalenti wrote: I am trying to understand why for a topology I am trying to run on 0.9.1-incubating, the supervisor on the machine is killing *all* of the topology's Storm workers periodically. Whether I use topology.workers=1,2,4, or 8, I always get logs like this: https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b Which basically indicates that the supervisor thinks all the workers timed out at exactly the same time, and then it kills them all. I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120 secs, but this hasn't helped at all. No matter what, periodically, the workers just get whacked by the supervisor and the whole topology has to restart. I notice that this does happen less frequently if the machine is under less load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or 200, then it runs for awhile without crashing. But I've even seen it crash in this state. I saw on some other threads that people indicated that the supervisor will kill all workers if "the nimbus fails to see a heartbeat from zookeeper". Could someone walk me through how I could figure out if this is the case? Nothing in the logs seems to point me in this direction. Thanks! Andrew
Re: Supervisor kills *all* workers for topology due to heartbeat :timed-out state
>From what I have seen, if nimbus kills and reassigns the worker process, the supervisor logs will report that the worker is in a "disallowed" state. I have seen the supervisor report the worker in a timed out state and restart the worker processes, generally when the system is under heavy CPU load. We recently ran into this issue while running a topology on virtual machines. Increasing the number of virtual cores assigned to the vm's resolved the restart issues. Thanks Justin Sent from my iPhone On Jun 14, 2014, at 11:32 AM, Andrew Montalenti wrote: I am trying to understand why for a topology I am trying to run on 0.9.1-incubating, the supervisor on the machine is killing *all* of the topology's Storm workers periodically. Whether I use topology.workers=1,2,4, or 8, I always get logs like this: https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b Which basically indicates that the supervisor thinks all the workers timed out at exactly the same time, and then it kills them all. I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120 secs, but this hasn't helped at all. No matter what, periodically, the workers just get whacked by the supervisor and the whole topology has to restart. I notice that this does happen less frequently if the machine is under less load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or 200, then it runs for awhile without crashing. But I've even seen it crash in this state. I saw on some other threads that people indicated that the supervisor will kill all workers if "the nimbus fails to see a heartbeat from zookeeper". Could someone walk me through how I could figure out if this is the case? Nothing in the logs seems to point me in this direction. Thanks! Andrew
Supervisor kills *all* workers for topology due to heartbeat :timed-out state
I am trying to understand why for a topology I am trying to run on 0.9.1-incubating, the supervisor on the machine is killing *all* of the topology's Storm workers periodically. Whether I use topology.workers=1,2,4, or 8, I always get logs like this: https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b Which basically indicates that the supervisor thinks all the workers timed out at exactly the same time, and then it kills them all. I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120 secs, but this hasn't helped at all. No matter what, periodically, the workers just get whacked by the supervisor and the whole topology has to restart. I notice that this does happen less frequently if the machine is under less load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or 200, then it runs for awhile without crashing. But I've even seen it crash in this state. I saw on some other threads that people indicated that the supervisor will kill all workers if "the nimbus fails to see a heartbeat from zookeeper". Could someone walk me through how I could figure out if this is the case? Nothing in the logs seems to point me in this direction. Thanks! Andrew