We are currently running a storm cluster on one machine. So there is one
nimbus/supervisor instance in a given cluster. We have recently had issues
where Nimbus was started and was unable to become leader. There were no other
instances running at this time. The cluster we seemingly brought down
successfully:
1609 2019-09-21 22:12:47,518 INFO nimbus [Thread-7] Shutting down master
1610 2019-09-21 22:12:47,520 INFO CuratorFrameworkImpl [Curator-Framework-0]
backgroundOperationsLoop exiting
1611 2019-09-21 22:12:47,527 INFO ZooKeeper [Thread-7] Session:
0x30000223e30079a closed
1612 2019-09-21 22:12:47,527 INFO ClientCnxn [main-EventThread] EventThread
shut down
1613 2019-09-21 22:12:47,528 INFO CuratorFrameworkImpl [Curator-Framework-0]
backgroundOperationsLoop exiting
1614 2019-09-21 22:12:47,533 INFO ClientCnxn [main-EventThread] EventThread
shut down
1615 2019-09-21 22:12:47,533 INFO ZooKeeper [Thread-7] Session:
0x30000223e30079b closed
1616 2019-09-21 22:12:47,534 INFO CuratorFrameworkImpl [Curator-Framework-0]
backgroundOperationsLoop exiting
1617 2019-09-21 22:12:47,539 INFO ClientCnxn [main-EventThread] EventThread
shut down
1618 2019-09-21 22:12:47,539 INFO ZooKeeper [Thread-7] Session:
0x30000223e300798 closed
1619 2019-09-21 22:12:47,539 INFO nimbus [Thread-7] Shut down master
And then brought back up 20 minutes later. When brought up, we immediately
started seeing:
2019-09-21 22:32:47,082 INFO JmxPreparableReporter [main] Preparing...
2019-09-21 22:32:47,098 INFO common [main] Started statistics report plugin...
2019-09-21 22:32:47,140 INFO nimbus [main] Starting nimbus server for storm
version '1.2.1'
2019-09-21 22:32:47,219 INFO PlainSaslTransportPlugin [main] SASL PLAIN
transport factory will be used
2019-09-21 22:32:47,858 INFO nimbus [timer] not a leader, skipping assignments
2019-09-21 22:32:47,858 INFO nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:32:47,860 INFO nimbus [timer] not a leader, skipping credential
renewal.
2019-09-21 22:32:49,134 INFO AbstractSaslServerCallbackHandler
[pool-14-thread-1] Successfully authenticated client: authenticationID = op
authorizationID = op
2019-09-21 22:32:49,171 INFO AbstractSaslServerCallbackHandler
[pool-14-thread-2] Successfully authenticated client: authenticationID = op
authorizationID = op
2019-09-21 22:32:57,858 INFO nimbus [timer] not a leader, skipping assignments
2019-09-21 22:32:57,859 INFO nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:33:07,860 INFO nimbus [timer] not a leader, skipping assignments
2019-09-21 22:33:07,860 INFO nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:33:17,862 INFO nimbus [timer] not a leader, skipping assignments
followed shortly by:
2019-09-21 22:33:52,409 WARN nimbus [pool-14-thread-7] Topology submission
exception. (topology name='WingmanTopology4159') #error {
:cause not a leader, current leader is NimbusInfo{host='trslnydtraap01',
port=30553, isLeader=true}
:via
[{:type java.lang.RuntimeException
:message not a leader, current leader is NimbusInfo{host='trslnydtraap01',
port=30553, isLeader=true}
:at [org.apache.storm.daemon.nimbus$is_leader doInvoke nimbus.clj 150]}]
:trace
What could cause this election issue? If no other leader processes are running
or known in the cluster, I am assuming that some sort of cluster state was not
cleaned up correctly, either in ZooKeeper or on disk. In general, how does
Storm mark whether there is a leader or not in a cluster? What could be the
cause of the issue posted above?