Looking at the storm worker thread dumps, it looks like the bolt/spout components are all waiting for the topology to become active (including the __system and __acker components):
"Thread-37-__acker" #70 prio=5 os_prio=0 tid=0x00007f6feca57000 nid=0x3a8d waiting on condition [0x00007f6f54ef2000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713) at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) ... "Thread-35-my-bolt" #68 prio=5 os_prio=0 tid=0x00007f6feca53000 nid=0x3a8b waiting on condition [0x00007f6f550f4000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713) at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) ... "Thread-17-__system" #50 prio=5 os_prio=0 tid=0x00007f6feca31800 nid=0x3a79 waiting on condition [0x00007f6f563a7000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713) at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Looking in the store code, I see that they are sleeping here: (while (not @(:storm-active-atom executor-data)) (Thread/sleep 100)) At what point does the topology become "active"? On Mon, May 2, 2016 at 11:38 AM, Kevin Conaway <kevin.a.cona...@gmail.com> wrote: > Some more interesting info. > > As I mentioned in the last post, we have 6 workers. Each worker has 2 > slots. > > Only one of the supervisors received a state :timed-out message. After it > received the message, it restarted both workers successfully. > > The other 5 supervisors all experienced the same issue as I mentioned in > the last post. The workers were launched twice. Whats interesting is that > both workers on those 5 nodes came up the second time in a hung state, they > all had a lot of ERROR messages from the netty client about not being able > to connect to the other nodes: > > 2016-04-30 01:34:59.758 b.s.m.n.Client [ERROR] connection attempt 26 to > Netty-Client-StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700 > failed: java.net.ConnectException: Connection refused: > StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700 > It looks like the bolts and spouts never got fully prepared (as evidenced > by lack of log messages indicating so) > > On the one supervisor which _did_ receive the timeout message, the 2 > workers came up just fine. > > On Mon, May 2, 2016 at 11:21 AM, Kevin Conaway <kevin.a.cona...@gmail.com> > wrote: > >> Unfortunately we're not capturing disk i/o in our metrics, I can look in >> to doing that for next time. >> >> We're not capturing GC logs, we are using the graphite storm metric >> consumer to push metrics to graphite, one of which is the GC time from the >> default GC mxbean. >> >> > I'm assuming you're saying that multiple workers had state :timed-out >> at once? >> >> We have 6 workers. Only one had state :timed-out, the others had state >> :disallowed. Looking at one of the supervisors as an example, it looks >> like it received multiple requests to reschedule the worker which caused >> the worker to be launched multiple times. Is this normal? >> >> 2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and >> clearing state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current >> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs >> 1461980040, :storm-id "<topology>", :executors [[124 124] [64 64] [196 196] >> [40 40] [28 28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148 >> 148] [136 136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700} >> >> 2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down >> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7 >> >> 2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and >> clearing state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current >> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs >> 1461980040, :storm-id "<topology>", :executors [[178 178] [58 58] [190 190] >> [118 118] [22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46 >> 46] [82 82] [154 154] [94 94] [34 34] [130 130]], :port 6701} >> >> 2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down >> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841 >> >> 2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with >> assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103] >> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153 >> 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 >> 113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port >> 6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67 >> >> 2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with >> assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68] >> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 >> 48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 >> 78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701 >> with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6 >> >> >> 2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and >> clearing state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current >> supervisor time: 1461980081. State: :disallowed, Heartbeat: {:time-secs >> 1461980080, :storm-id "<topology>", :executors [[3 3] [33 33] [103 103] >> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1] >> [153 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] >> [113 113]], :port 6700} >> >> 2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down >> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67 >> >> 2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and >> clearing state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current >> supervisor time: 1461980081. State: :disallowed, Heartbeat: {:time-secs >> 1461980080, :storm-id "<topology>", :executors [[8 8] [188 188] [68 68] >> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 >> 48] [-1 -1] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 >> 168] [78 78]], :port 6701} >> >> 2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down >> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6 >> >> On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <eweath...@groupon.com> >> wrote: >> >>> Maybe disk I/O was high? Are you capturing GC logs to disk in unique >>> files (you can sub in the PID and timestamp into the GC log filename)? I >>> know you believe it's not responsible, but it's the only thing I've ever >>> found to be responsible thus far. (Except for a problem in storm 0.9.3 >>> with netty that has since been fixed -- we worked around that by >>> downgrading to zero-MQ.) You might try monitoring the heartbeat files >>> written by the workers to watch for the file creation to be happening less >>> frequently than once per second. >>> >>> > all of the worker sessions expired at the same time >>> >>> I'm assuming you're saying that multiple workers had state :timed-out at >>> once? Was that on the same host? If the state is :disallowed, that is >>> perfectly normal when the reassignment happens, as I described earlier. >>> >>> - Erik >>> >>> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.cona...@gmail.com> wrote: >>> >>>> Any tips on where to continue investigating or other metrics to capture? >>>> >>>> As i mentioned before, the topology was mostly idle. Low cpu usage, low >>>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all >>>> of the worker sessions expired at the same time >>>> >>>> On Sunday, May 1, 2016, Erik Weathers <eweath...@groupon.com> wrote: >>>> >>>>> To be clear, the supervisor wasn't shutting itself down, it was >>>>> killing the worker process. >>>>> >>>>> Also for clarity, the :disallowed state simply means a new >>>>> task->worker assignment has been computed by the nimbus and so the >>>>> supervisor has noticed from the ZK state that the currently running worker >>>>> is no longer supposed to be running. >>>>> >>>>> Regarding the :timed-out state it *definitely* means the worker was >>>>> hung and couldn't write to the local heartbeat file within the timeout (30 >>>>> secs by default). As for *why*... That is hard to determine. >>>>> >>>>> - Erik >>>>> >>>>> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.cona...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Erik. We're using Storm 0.10 so Pacemaker doesn't come in to >>>>>> play here. >>>>>> >>>>>> 3. Worker process died due to exception (this is almost always what >>>>>> we see) >>>>>> 4. Worker process hung (eg doing GC). >>>>>> >>>>>> I don't think its either of these, TBH. There were no abnormal >>>>>> terminations in the logs (or in the Storm error logs). I have the GC >>>>>> metrics that were reported to graphite and the GC time was fairly low >>>>>> (~50ms). >>>>>> >>>>>> My feeling is that it is Zookeeper related and not worker related >>>>>> because it appears that ZK also expired sessions from the supervisor >>>>>> nodes >>>>>> as well: >>>>>> >>>>>> I saw the following messages in the log of one of the ZK nodes: >>>>>> >>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>> 0x45453e198e80076, timeout of 20000ms exceeded >>>>>> >>>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO [ProcessThread(sid:4 >>>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination >>>>>> for sessionid: 0x45453e198e80076 >>>>>> >>>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO >>>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection >>>>>> for client /<ip>:44500 which had sessionid 0x45453e198e80076 >>>>>> >>>>>> And some corresponding log messages on one of the supervisor log >>>>>> files (for the same ZK session ID). It looks like this occurred right as >>>>>> the supervisor was shutting down due to Nimbus restarting it: >>>>>> >>>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to >>>>>> read additional data from server sessionid 0x45453e198e80076, likely >>>>>> server >>>>>> has closed socket, closing socket connection and attempting reconnect >>>>>> >>>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and >>>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current >>>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs >>>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98 >>>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 >>>>>> 74] >>>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700} >>>>>> >>>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down >>>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b >>>>>> >>>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user >>>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b >>>>>> >>>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for >>>>>> execution of cleanup threads on worker. >>>>>> >>>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager >>>>>> [INFO] State change: SUSPENDED >>>>>> >>>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event >>>>>> :disconnected::none: with disconnected Zookeeper. >>>>>> >>>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening >>>>>> socket connection to server <zk host>:2181. Will not attempt to >>>>>> authenticate using SASL (unknown error) >>>>>> >>>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket >>>>>> connection established to <zk host>:2181, initiating session >>>>>> >>>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to >>>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired, >>>>>> closing socket connection >>>>>> >>>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager >>>>>> [INFO] State change: LOST >>>>>> >>>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event >>>>>> :expired::none: with disconnected Zookeeper. >>>>>> >>>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session >>>>>> expired event received >>>>>> >>>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweath...@groupon.com >>>>>> > wrote: >>>>>> >>>>>>> (I don't know anything about the pacemaker service introduced in >>>>>>> storm 1.0, so this statement is pre-1.0). >>>>>>> >>>>>>> The executor threads within the worker processes write heartbeats to >>>>>>> ZooKeeper. If they aren't successfully heartbeating then it could be >>>>>>> many >>>>>>> things: >>>>>>> >>>>>>> 1. ZK too busy? (Seems unlikely) >>>>>>> 2. Network too busy? (Seems unlikely) >>>>>>> 3. Worker process died due to exception (this is almost always what >>>>>>> we see) >>>>>>> 4. Worker process hung (eg doing GC). (This would usually first be >>>>>>> caught by the supervisor on that host since it checks a local hearbeat >>>>>>> file >>>>>>> that the worker normally writes to every second -- if the heartbeat >>>>>>> doesn't >>>>>>> get refreshed before the timeout then the supervisor kills the worker >>>>>>> process with State being :timed-out (or :time-out, something like that). >>>>>>> This of course depends on the various timeout config values you have on >>>>>>> the >>>>>>> worker and nimbus hosts.) >>>>>>> >>>>>>> - Erik >>>>>>> >>>>>>> >>>>>>> On Saturday, April 30, 2016, Kevin Conaway < >>>>>>> kevin.a.cona...@gmail.com> wrote: >>>>>>> >>>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to >>>>>>>> restart our topology. From researching past threads it seems like >>>>>>>> this is >>>>>>>> related to not receiving heartbeats from the supervisors but I'm >>>>>>>> unsure if >>>>>>>> this was the case. Our topology was mostly idle at the time that the >>>>>>>> restart was triggered. >>>>>>>> >>>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble. On one of the ZK >>>>>>>> nodes, I saw the following messages at the time of the restart: >>>>>>>> >>>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded >>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded >>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded >>>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded >>>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded >>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded >>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO >>>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded >>>>>>>> >>>>>>>> In the nimbus log, there was the following log message: >>>>>>>> >>>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 >>>>>>>> 8] not alive >>>>>>>> >>>>>>>> Shortly thereafter, the supervisors started restarting the >>>>>>>> workers. The following log message was in the supervisor log: >>>>>>>> >>>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and >>>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current >>>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs >>>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 >>>>>>>> 51] >>>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] >>>>>>>> [87 87] >>>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700} >>>>>>>> >>>>>>>> Previous threads have suggested that this was due to heavy GC >>>>>>>> causing the heartbeats not to reach Zookeeper but the topology was >>>>>>>> idle at >>>>>>>> this time so I don't think GC was the culprit. The GC par new time was >>>>>>>> about 50ms on each node (as reported to Graphite). >>>>>>>> >>>>>>>> Thoughts on what could have happened here and how to debug further? >>>>>>>> >>>>>>>> -- >>>>>>>> Kevin Conaway >>>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>>>>>> https://github.com/kevinconaway >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Kevin Conaway >>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>>>> https://github.com/kevinconaway >>>>>> >>>>> >>>> >>>> -- >>>> Kevin Conaway >>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>> https://github.com/kevinconaway >>>> >>>> >> >> >> -- >> Kevin Conaway >> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >> https://github.com/kevinconaway >> > > > > -- > Kevin Conaway > http://www.linkedin.com/pub/kevin-conaway/7/107/580/ > https://github.com/kevinconaway > -- Kevin Conaway http://www.linkedin.com/pub/kevin-conaway/7/107/580/ https://github.com/kevinconaway