We are using Storm 0.10 and we noticed that Nimbus decided to restart our
topology.  From researching past threads it seems like this is related to
not receiving heartbeats from the supervisors but I'm unsure if this was
the case.  Our topology was mostly idle at the time that the restart was
triggered.

We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I saw
the following messages at the time of the restart:

2016-04-30 01:33:46,001 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x45453e198e8007f, timeout of 20000ms exceeded
2016-04-30 01:33:46,003 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x25453e1c2640085, timeout of 20000ms exceeded
2016-04-30 01:33:46,003 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x45453e198e80076, timeout of 20000ms exceeded
2016-04-30 01:33:48,003 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x35453e1a529008b, timeout of 20000ms exceeded
2016-04-30 01:33:50,001 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x15453e198d10084, timeout of 20000ms exceeded
2016-04-30 01:33:50,002 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x35453e1a5290090, timeout of 20000ms exceeded
2016-04-30 01:33:50,002 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x15453e198d1008e, timeout of 20000ms exceeded

In the nimbus log, there was the following log message:

2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not
alive

Shortly thereafter, the supervisors started restarting the workers.  The
 following log message was in the supervisor log:

2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current supervisor time:
1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
:storm-id "<topology>", :executors [[111 111] [75 75] [51 51] [3 3] [39 39]
[159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87] [171 171]
[195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}

Previous threads have suggested that this was due to heavy GC causing the
heartbeats not to reach Zookeeper but the topology was idle at this time so
I don't think GC was the culprit.  The GC par new time was about 50ms on
each node (as reported to Graphite).

Thoughts on what could have happened here and how to debug further?

-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Reply via email to