To follow up on this, we saw a similar issue again today. A machine in our ZooKeeper cluster was brought down for maintenance, which happened to be the leader. Quorum was not lost, and a new leader was elected 5 seconds after. However, after this temporary disconnection, our nimbus could no longer elect itself leader. Should Storm 1.2.1 handle this situation correctly, or is there a known bug?
Logs: Losing connection (leader was lost at 14:32:31, new leader elected at 14:32:36): 2019-09-23 14:32:33,216 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n06:25173)] Unable to read additional data from server sessionid 0x100805579713ea8, likely server has closed socket, closing socket connection and attempting reconnect 33 2019-09-23 14:32:33,217 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n08:25173)] Unable to read additional data from server sessionid 0x30000223e30094e, likely server has closed socket, closing socket connection and attempting reconnect 34 2019-09-23 14:32:33,218 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n06:25173)] Unable to read additional data from server sessionid 0x100805579713ea7, likely server has closed socket, closing socket connection and attempting reconnect 35 2019-09-23 14:32:33,218 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n07:25173)] Unable to read additional data from server sessionid 0x20000287dbd08ab, likely server has closed socket, closing socket connection and attempting reconnect 2019-09-23 14:32:33,321 INFO Zookeeper [Curator-ConnectionStateManager-0] trslnydtraap02 lost leadership. Reconnecting: 2019-09-23 14:32:36,284 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n07:25173)] Opening socket connection to server dob2-bfs-r5n06/10.122.73.209:25173. Will not attempt to authenticate using SASL (unknown error) 83 2019-09-23 14:32:36,285 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n07:25173)] Socket connection established to dob2-bfs-r5n06/10.122.73.209:25173, initiating session 84 2019-09-23 14:32:36,288 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n07:25173)] Session establishment complete on server dob2-bfs-r5n06/10.122.73.209:25173, sessionid = 0x20000287dbd08ab, negotiated timeout = 40000 85 2019-09-23 14:32:36,288 INFO ConnectionStateManager [main-EventThread] State change: RECONNECTED 86 2019-09-23 14:32:36,288 INFO zookeeper [main-EventThread] Zookeeper state update: :connected:none 87 2019-09-23 14:32:36,454 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n08:25173)] Opening socket connection to server dob2-bfs-r5n08/10.122.73.211:25173. Will not attempt to authenticate using SASL (unknown error) 88 2019-09-23 14:32:36,455 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n08:25173)] Socket connection established to dob2-bfs-r5n08/10.122.73.211:25173, initiating session 89 2019-09-23 14:32:36,458 INFO ClientCnxn [main-SendThread(dob2-bfs-r5n08:25173)] Session establishment complete on server dob2-bfs-r5n08/10.122.73.211:25173, sessionid = 0x30000223e30094e, negotiated timeout = 40000 90 2019-09-23 14:32:36,458 INFO ConnectionStateManager [main-EventThread] State change: RECONNECTED 91 2019-09-23 14:32:40,342 INFO nimbus [timer] not a leader, skipping assignments 92 2019-09-23 14:32:40,342 INFO nimbus [timer] not a leader, skipping cleanup 93 2019-09-23 14:32:50,343 INFO nimbus [timer] not a leader, skipping assignments 94 2019-09-23 14:32:50,343 INFO nimbus [timer] not a leader, skipping cleanup 95 2019-09-23 14:33:00,343 INFO nimbus [timer] not a leader, skipping assignments 96 2019-09-23 14:33:00,344 INFO nimbus [timer] not a leader, skipping cleanup 97 2019-09-23 14:33:10,344 INFO nimbus [timer] not a leader, skipping assignments 98 2019-09-23 14:33:10,345 INFO nimbus [timer] not a leader, skipping cleanup 99 2019-09-23 14:33:20,345 INFO nimbus [timer] not a leader, skipping assignments From: user@storm.apache.org At: 09/23/19 12:46:31To: user@storm.apache.org Subject: Leader Election issues on cluster restart We are currently running a storm cluster on one machine. So there is one nimbus/supervisor instance in a given cluster. We have recently had issues where Nimbus was started and was unable to become leader. There were no other instances running at this time. The cluster we seemingly brought down successfully: 1609 2019-09-21 22:12:47,518 INFO nimbus [Thread-7] Shutting down master 1610 2019-09-21 22:12:47,520 INFO CuratorFrameworkImpl [Curator-Framework-0] backgroundOperationsLoop exiting 1611 2019-09-21 22:12:47,527 INFO ZooKeeper [Thread-7] Session: 0x30000223e30079a closed 1612 2019-09-21 22:12:47,527 INFO ClientCnxn [main-EventThread] EventThread shut down 1613 2019-09-21 22:12:47,528 INFO CuratorFrameworkImpl [Curator-Framework-0] backgroundOperationsLoop exiting 1614 2019-09-21 22:12:47,533 INFO ClientCnxn [main-EventThread] EventThread shut down 1615 2019-09-21 22:12:47,533 INFO ZooKeeper [Thread-7] Session: 0x30000223e30079b closed 1616 2019-09-21 22:12:47,534 INFO CuratorFrameworkImpl [Curator-Framework-0] backgroundOperationsLoop exiting 1617 2019-09-21 22:12:47,539 INFO ClientCnxn [main-EventThread] EventThread shut down 1618 2019-09-21 22:12:47,539 INFO ZooKeeper [Thread-7] Session: 0x30000223e300798 closed 1619 2019-09-21 22:12:47,539 INFO nimbus [Thread-7] Shut down master And then brought back up 20 minutes later. When brought up, we immediately started seeing: 2019-09-21 22:32:47,082 INFO JmxPreparableReporter [main] Preparing... 2019-09-21 22:32:47,098 INFO common [main] Started statistics report plugin... 2019-09-21 22:32:47,140 INFO nimbus [main] Starting nimbus server for storm version '1.2.1' 2019-09-21 22:32:47,219 INFO PlainSaslTransportPlugin [main] SASL PLAIN transport factory will be used 2019-09-21 22:32:47,858 INFO nimbus [timer] not a leader, skipping assignments 2019-09-21 22:32:47,858 INFO nimbus [timer] not a leader, skipping cleanup 2019-09-21 22:32:47,860 INFO nimbus [timer] not a leader, skipping credential renewal. 2019-09-21 22:32:49,134 INFO AbstractSaslServerCallbackHandler [pool-14-thread-1] Successfully authenticated client: authenticationID = op authorizationID = op 2019-09-21 22:32:49,171 INFO AbstractSaslServerCallbackHandler [pool-14-thread-2] Successfully authenticated client: authenticationID = op authorizationID = op 2019-09-21 22:32:57,858 INFO nimbus [timer] not a leader, skipping assignments 2019-09-21 22:32:57,859 INFO nimbus [timer] not a leader, skipping cleanup 2019-09-21 22:33:07,860 INFO nimbus [timer] not a leader, skipping assignments 2019-09-21 22:33:07,860 INFO nimbus [timer] not a leader, skipping cleanup 2019-09-21 22:33:17,862 INFO nimbus [timer] not a leader, skipping assignments followed shortly by: 2019-09-21 22:33:52,409 WARN nimbus [pool-14-thread-7] Topology submission exception. (topology name='WingmanTopology4159') #error { :cause not a leader, current leader is NimbusInfo{host='trslnydtraap01', port=30553, isLeader=true} :via [{:type java.lang.RuntimeException :message not a leader, current leader is NimbusInfo{host='trslnydtraap01', port=30553, isLeader=true} :at [org.apache.storm.daemon.nimbus$is_leader doInvoke nimbus.clj 150]}] :trace What could cause this election issue? If no other leader processes are running or known in the cluster, I am assuming that some sort of cluster state was not cleaned up correctly, either in ZooKeeper or on disk. In general, how does Storm mark whether there is a leader or not in a cluster? What could be the cause of the issue posted above?