Re: How Does Nimbus Decide to Restart Topology?

Kevin Conaway Mon, 09 May 2016 09:32:40 -0700

Looking at the storm worker thread dumps, it looks like the bolt/spout
components are all waiting for the topology to become active (including the
__system and __acker components):


"Thread-37-__acker" #70 prio=5 os_prio=0 tid=0x00007f6feca57000 nid=0x3a8d
waiting on condition [0x00007f6f54ef2000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

...

"Thread-35-my-bolt" #68 prio=5 os_prio=0 tid=0x00007f6feca53000 nid=0x3a8b
waiting on condition [0x00007f6f550f4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

...
"Thread-17-__system" #50 prio=5 os_prio=0 tid=0x00007f6feca31800 nid=0x3a79
waiting on condition [0x00007f6f563a7000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

Looking in the store code, I see that they are sleeping here:

(while (not @(:storm-active-atom executor-data))
  (Thread/sleep 100))

At what point does the topology become "active"?


On Mon, May 2, 2016 at 11:38 AM, Kevin Conaway <kevin.a.cona...@gmail.com>
wrote:

> Some more interesting info.
>
> As I mentioned in the last post, we have 6 workers.  Each worker has 2
> slots.
>
> Only one of the supervisors received a state :timed-out message.  After it
> received the message, it restarted both workers successfully.
>
> The other 5 supervisors all experienced the same issue as I mentioned in
> the last post.  The workers were launched twice.  Whats interesting is that
> both workers on those 5 nodes came up the second time in a hung state, they
> all had a lot of ERROR messages from the netty client about not being able
> to connect to the other nodes:
>
> 2016-04-30 01:34:59.758 b.s.m.n.Client [ERROR] connection attempt 26 to
> Netty-Client-StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
> failed: java.net.ConnectException: Connection refused:
> StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
> It looks like the bolts and spouts never got fully prepared (as evidenced
> by lack of log messages indicating so)
>
> On the one supervisor which _did_ receive the timeout message, the 2
> workers came up just fine.
>
> On Mon, May 2, 2016 at 11:21 AM, Kevin Conaway <kevin.a.cona...@gmail.com>
> wrote:
>
>> Unfortunately we're not capturing disk i/o in our metrics, I can look in
>> to doing that for next time.
>>
>> We're not capturing GC logs, we are using the graphite storm metric
>> consumer to push metrics to graphite, one of which is the GC time from the
>> default GC mxbean.
>>
>> > I'm assuming you're saying that multiple workers had state :timed-out
>> at once?
>>
>> We have 6 workers.  Only one had state :timed-out, the others had state
>> :disallowed.  Looking at one of the supervisors as an example, it looks
>> like it received multiple requests to reschedule the worker which caused
>> the worker to be launched multiple times.  Is this normal?
>>
>> 2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current
>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>> 1461980040, :storm-id "<topology>", :executors [[124 124] [64 64] [196 196]
>> [40 40] [28 28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148
>> 148] [136 136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700}
>>
>> 2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7
>>
>> 2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current
>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>> 1461980040, :storm-id "<topology>", :executors [[178 178] [58 58] [190 190]
>> [118 118] [22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46
>> 46] [82 82] [154 154] [94 94] [34 34] [130 130]], :port 6701}
>>
>> 2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841
>>
>> 2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with
>> assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
>> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153
>> 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113
>> 113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port
>> 6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67
>>
>> 2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with
>> assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
>> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
>> 48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78
>> 78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701
>> with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>>
>>
>> 2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current
>> supervisor time: 1461980081. State: :disallowed, Heartbeat: {:time-secs
>> 1461980080, :storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
>> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1]
>> [153 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183]
>> [113 113]], :port 6700}
>>
>> 2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67
>>
>> 2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current
>> supervisor time: 1461980081. State: :disallowed, Heartbeat: {:time-secs
>> 1461980080, :storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
>> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
>> 48] [-1 -1] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168
>> 168] [78 78]], :port 6701}
>>
>> 2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>>
>> On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <eweath...@groupon.com>
>> wrote:
>>
>>> Maybe disk I/O was high?  Are you capturing GC logs to disk in unique
>>> files (you can sub in the PID and timestamp into the GC log filename)?  I
>>> know you believe it's not responsible, but it's the only thing I've ever
>>> found to be responsible thus far.  (Except for a problem in storm 0.9.3
>>> with netty that has since been fixed -- we worked around that by
>>> downgrading to zero-MQ.)  You might try monitoring the heartbeat files
>>> written by the workers to watch for the file creation to be happening less
>>> frequently than once per second.
>>>
>>> > all of the worker sessions expired at the same time
>>>
>>> I'm assuming you're saying that multiple workers had state :timed-out at
>>> once?  Was that on the same host?  If the state is :disallowed, that is
>>> perfectly normal when the reassignment happens, as I described earlier.
>>>
>>> - Erik
>>>
>>> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.cona...@gmail.com> wrote:
>>>
>>>> Any tips on where to continue investigating or other metrics to capture?
>>>>
>>>> As i mentioned before, the topology was mostly idle. Low cpu usage, low
>>>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all
>>>> of the worker sessions expired at the same time
>>>>
>>>> On Sunday, May 1, 2016, Erik Weathers <eweath...@groupon.com> wrote:
>>>>
>>>>> To be clear, the supervisor wasn't shutting  itself down, it was
>>>>> killing the worker process.
>>>>>
>>>>> Also for clarity, the :disallowed state simply means a new
>>>>> task->worker assignment has been computed by the nimbus and so the
>>>>> supervisor has noticed from the ZK state that the currently running worker
>>>>> is no longer supposed to be running.
>>>>>
>>>>> Regarding the :timed-out state it *definitely* means the worker was
>>>>> hung and couldn't write to the local heartbeat file within the timeout (30
>>>>> secs by default).  As for *why*... That is hard to determine.
>>>>>
>>>>> - Erik
>>>>>
>>>>> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.cona...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>>>>> play here.
>>>>>>
>>>>>> 3. Worker process died due to exception (this is almost always what
>>>>>> we see)
>>>>>> 4. Worker process hung (eg doing GC).
>>>>>>
>>>>>> I don't think its either of these, TBH.  There were no abnormal
>>>>>> terminations in the logs (or in the Storm error logs). I have the GC
>>>>>> metrics that were reported to graphite and the GC time was fairly low
>>>>>> (~50ms).
>>>>>>
>>>>>> My feeling is that it is Zookeeper related and not worker related
>>>>>> because it appears that ZK also expired sessions from the supervisor 
>>>>>> nodes
>>>>>> as well:
>>>>>>
>>>>>> I saw the following messages in the log of one of the ZK nodes:
>>>>>>
>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>>
>>>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>>>>> for sessionid: 0x45453e198e80076
>>>>>>
>>>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection
>>>>>> for client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>>>>
>>>>>> And some corresponding log messages on one of the supervisor log
>>>>>> files (for the same ZK session ID).  It looks like this occurred right as
>>>>>> the supervisor was shutting down due to Nimbus restarting it:
>>>>>>
>>>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>>>> read additional data from server sessionid 0x45453e198e80076, likely 
>>>>>> server
>>>>>> has closed socket, closing socket connection and attempting reconnect
>>>>>>
>>>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 
>>>>>> 74]
>>>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>>>>
>>>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>>
>>>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>>
>>>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>>>>> execution of cleanup threads on worker.
>>>>>>
>>>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>>> [INFO] State change: SUSPENDED
>>>>>>
>>>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>>>>> :disconnected::none: with disconnected Zookeeper.
>>>>>>
>>>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening
>>>>>> socket connection to server <zk host>:2181. Will not attempt to
>>>>>> authenticate using SASL (unknown error)
>>>>>>
>>>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>>>>> connection established to <zk host>:2181, initiating session
>>>>>>
>>>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>>>>> closing socket connection
>>>>>>
>>>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>>> [INFO] State change: LOST
>>>>>>
>>>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>>>>> :expired::none: with disconnected Zookeeper.
>>>>>>
>>>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>>>>> expired event received
>>>>>>
>>>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweath...@groupon.com
>>>>>> > wrote:
>>>>>>
>>>>>>> (I don't know anything about the pacemaker service introduced in
>>>>>>> storm 1.0, so this statement is pre-1.0).
>>>>>>>
>>>>>>> The executor threads within the worker processes write heartbeats to
>>>>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be 
>>>>>>> many
>>>>>>> things:
>>>>>>>
>>>>>>> 1. ZK too busy? (Seems unlikely)
>>>>>>> 2. Network too busy? (Seems unlikely)
>>>>>>> 3. Worker process died due to exception (this is almost always what
>>>>>>> we see)
>>>>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>>>>> caught by the supervisor on that host since it checks a local hearbeat 
>>>>>>> file
>>>>>>> that the worker normally writes to every second -- if the heartbeat 
>>>>>>> doesn't
>>>>>>> get refreshed before the timeout then the supervisor kills the worker
>>>>>>> process with State being :timed-out (or :time-out, something like that).
>>>>>>> This of course depends on the various timeout config values you have on 
>>>>>>> the
>>>>>>> worker and nimbus hosts.)
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>> On Saturday, April 30, 2016, Kevin Conaway <
>>>>>>> kevin.a.cona...@gmail.com> wrote:
>>>>>>>
>>>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to
>>>>>>>> restart our topology.  From researching past threads it seems like 
>>>>>>>> this is
>>>>>>>> related to not receiving heartbeats from the supervisors but I'm 
>>>>>>>> unsure if
>>>>>>>> this was the case.  Our topology was mostly idle at the time that the
>>>>>>>> restart was triggered.
>>>>>>>>
>>>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK
>>>>>>>> nodes, I saw the following messages at the time of the restart:
>>>>>>>>
>>>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>>>>
>>>>>>>> In the nimbus log, there was the following log message:
>>>>>>>>
>>>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8
>>>>>>>> 8] not alive
>>>>>>>>
>>>>>>>> Shortly thereafter, the supervisors started restarting the
>>>>>>>> workers.  The  following log message was in the supervisor log:
>>>>>>>>
>>>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 
>>>>>>>> 51]
>>>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] 
>>>>>>>> [87 87]
>>>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>>>>
>>>>>>>> Previous threads have suggested that this was due to heavy GC
>>>>>>>> causing the heartbeats not to reach Zookeeper but the topology was 
>>>>>>>> idle at
>>>>>>>> this time so I don't think GC was the culprit.  The GC par new time was
>>>>>>>> about 50ms on each node (as reported to Graphite).
>>>>>>>>
>>>>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Kevin Conaway
>>>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>>>> https://github.com/kevinconaway
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Conaway
>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>> https://github.com/kevinconaway
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: How Does Nimbus Decide to Restart Topology?

Reply via email to