Re: How Does Nimbus Decide to Restart Topology?

Erik Weathers Sun, 01 May 2016 13:57:33 -0700

To be clear, the supervisor wasn't shutting  itself down, it was killing
the worker process.


Also for clarity, the :disallowed state simply means a new
task->worker assignment has been computed by the nimbus and so the
supervisor has noticed from the ZK state that the currently running worker
is no longer supposed to be running.

Regarding the :timed-out state it *definitely* means the worker was hung
and couldn't write to the local heartbeat file within the timeout (30 secs
by default).  As for *why*... That is hard to determine.

- Erik

On Sunday, May 1, 2016, Kevin Conaway <kevin.a.cona...@gmail.com> wrote:

> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to play
> here.
>
> 3. Worker process died due to exception (this is almost always what we see)
> 4. Worker process hung (eg doing GC).
>
> I don't think its either of these, TBH.  There were no abnormal
> terminations in the logs (or in the Storm error logs). I have the GC
> metrics that were reported to graphite and the GC time was fairly low
> (~50ms).
>
> My feeling is that it is Zookeeper related and not worker related because
> it appears that ZK also expired sessions from the supervisor nodes as well:
>
> I saw the following messages in the log of one of the ZK nodes:
>
> 2016-04-30 01:33:46,003 [myid:4] - INFO
> [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x45453e198e80076, timeout of 20000ms exceeded
>
> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
> cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> sessionid: 0x45453e198e80076
>
> 2016-04-30 01:33:46,009 [myid:4] - INFO
> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
> client /<ip>:44500 which had sessionid 0x45453e198e80076
>
> And some corresponding log messages on one of the supervisor log files
> (for the same ZK session ID).  It looks like this occurred right as the
> supervisor was shutting down due to Nimbus restarting it:
>
> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
> additional data from server sessionid 0x45453e198e80076, likely server has
> closed socket, closing socket connection and attempting reconnect
>
> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current supervisor time:
> 1461980064. State: :timed-out, Heartbeat: {:time-secs 1461980008, :storm-id
> "<topology>", :executors [[62 62] [2 2] [38 38] [98 98] [50 50] [-1 -1]
> [170 170] [86 86] [194 194] [158 158] [122 122] [74 74] [26 26] [146 146]
> [110 110] [14 14] [182 182] [134 134]], :port 6700}
>
> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>
> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
> 4e6934f9-88e4-4506-9d6e-25db72be941b
>
> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
> execution of cleanup threads on worker.
>
> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
> State change: SUSPENDED
>
> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
> :disconnected::none: with disconnected Zookeeper.
>
> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
> connection to server <zk host>:2181. Will not attempt to authenticate using
> SASL (unknown error)
>
> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection
> established to <zk host>:2181, initiating session
>
> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
> closing socket connection
>
> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
> State change: LOST
>
> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event :expired::none:
> with disconnected Zookeeper.
>
> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
> expired event received
>
> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweath...@groupon.com
> <javascript:_e(%7B%7D,'cvml','eweath...@groupon.com');>> wrote:
>
>> (I don't know anything about the pacemaker service introduced in storm
>> 1.0, so this statement is pre-1.0).
>>
>> The executor threads within the worker processes write heartbeats to
>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>> things:
>>
>> 1. ZK too busy? (Seems unlikely)
>> 2. Network too busy? (Seems unlikely)
>> 3. Worker process died due to exception (this is almost always what we
>> see)
>> 4. Worker process hung (eg doing GC). (This would usually first be caught
>> by the supervisor on that host since it checks a local hearbeat file that
>> the worker normally writes to every second -- if the heartbeat doesn't get
>> refreshed before the timeout then the supervisor kills the worker process
>> with State being :timed-out (or :time-out, something like that). This of
>> course depends on the various timeout config values you have on the worker
>> and nimbus hosts.)
>>
>> - Erik
>>
>>
>> On Saturday, April 30, 2016, Kevin Conaway <kevin.a.cona...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','kevin.a.cona...@gmail.com');>> wrote:
>>
>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>> our topology.  From researching past threads it seems like this is related
>>> to not receiving heartbeats from the supervisors but I'm unsure if this was
>>> the case.  Our topology was mostly idle at the time that the restart was
>>> triggered.
>>>
>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I
>>> saw the following messages at the time of the restart:
>>>
>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>
>>> In the nimbus log, there was the following log message:
>>>
>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8]
>>> not alive
>>>
>>> Shortly thereafter, the supervisors started restarting the workers.  The
>>>  following log message was in the supervisor log:
>>>
>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>
>>> Previous threads have suggested that this was due to heavy GC causing
>>> the heartbeats not to reach Zookeeper but the topology was idle at this
>>> time so I don't think GC was the culprit.  The GC par new time was about
>>> 50ms on each node (as reported to Graphite).
>>>
>>> Thoughts on what could have happened here and how to debug further?
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Re: How Does Nimbus Decide to Restart Topology?

Reply via email to