We are using Storm 0.10 and we noticed that Nimbus decided to restart our topology. From researching past threads it seems like this is related to not receiving heartbeats from the supervisors but I'm unsure if this was the case. Our topology was mostly idle at the time that the restart was triggered.
We have a 5 node Zookeeper (3.4.5) ensemble. On one of the ZK nodes, I saw the following messages at the time of the restart: 2016-04-30 01:33:46,001 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x45453e198e8007f, timeout of 20000ms exceeded 2016-04-30 01:33:46,003 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x25453e1c2640085, timeout of 20000ms exceeded 2016-04-30 01:33:46,003 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x45453e198e80076, timeout of 20000ms exceeded 2016-04-30 01:33:48,003 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x35453e1a529008b, timeout of 20000ms exceeded 2016-04-30 01:33:50,001 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x15453e198d10084, timeout of 20000ms exceeded 2016-04-30 01:33:50,002 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x35453e1a5290090, timeout of 20000ms exceeded 2016-04-30 01:33:50,002 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x15453e198d1008e, timeout of 20000ms exceeded In the nimbus log, there was the following log message: 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not alive Shortly thereafter, the supervisors started restarting the workers. The following log message was in the supervisor log: 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51] [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87] [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700} Previous threads have suggested that this was due to heavy GC causing the heartbeats not to reach Zookeeper but the topology was idle at this time so I don't think GC was the culprit. The GC par new time was about 50ms on each node (as reported to Graphite). Thoughts on what could have happened here and how to debug further? -- Kevin Conaway http://www.linkedin.com/pub/kevin-conaway/7/107/580/ https://github.com/kevinconaway