Set logging to info level. The reason is explained every time. Sorry, I don't have any examples....
You have to look at 3 logs: * nimbus - will say that it is killing a task/executor. As I recall, you have to figure out that the task/executor links up to the supervisor. * supervisor - will say that it is killing a worker. It will list the status of the worker. As I recall there are 2. I can't remember what they are called.. :( ** One status means that the supervisor hasn't heard from the worker. It will list 2 times as milliseconds. When you subtract the 2, you will see that they add up to the setting : supervisor.worker.timeout.secs ** The other status means that nimbus said to the supervisor : kill the worker holding the task. The supervisor then kills the worker process. * worker - if it is killed by a supervisor, there will be no logging about such. The supervisor is simply issuing a kill statement, so the worker log just ends. If the worker has decided to shut itself down, the logs will say why. A worker might shut itself down if say, there was an unhandled exception somewhere. from https://github.com/apache/storm/blob/master/conf/defaults.yaml : nimbus.task.timeout.secs: 30 nimbus.supervisor.timeout.secs: 60 #how long supervisor will wait to ensure that a worker process is started supervisor.worker.start.timeout.secs: 120 #how long between heartbeats until supervisor considers that worker dead and tries to restart it supervisor.worker.timeout.secs: 30 Thank you for your time! +++++++++++++++++++++ Jeff Maass <maas...@gmail.com> linkedin.com/in/jeffmaass stackoverflow.com/users/373418/maassql +++++++++++++++++++++ On Thu, May 28, 2015 at 8:47 PM, Fang Chen <fc2...@gmail.com> wrote: > Did you find any solutions to this? > > I ran into exactly the same situation with 0.9.4. I have a testing kafka > topic with around 10m tuples and the supervisor started to kill its worker > (first time around 15minutes later, then I saw the same after 2minutes, or > 6 minutes, although I have set supervisor.worker.start.timeout.secs and > supervisor.worker.timeout.secs to 40 minutes each). > > I then did another experiment, right after the topology was running, I > killed all supervisors and the workers could finish all tuples without > issues. > > On Thu, Apr 16, 2015 at 11:56 AM, Grant Overby (groverby) < > grove...@cisco.com> wrote: > >> I’m not, and If I had to guess I’d say it’s likely something is going >> wrong with the heartbeats, but how can I go about finding out? >> >> >> >> >> From: Paul Poulosky <ppoul...@yahoo-inc.com> >> Reply-To: "user@storm.apache.org" <user@storm.apache.org>, Paul Poulosky >> <ppoul...@yahoo-inc.com> >> Date: Thursday, April 16, 2015 at 2:38 PM >> To: "user@storm.apache.org" <user@storm.apache.org> >> Subject: Re: Supervisor repeatedly killing worker >> >> 10.0.1.5 >> > >