Re:Leader Election issues on cluster restart

Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mon, 23 Sep 2019 14:15:37 -0700

To follow up on this, we saw a similar issue again today. A machine in our 
ZooKeeper cluster was brought down for maintenance, which happened to be the 
leader. Quorum was not lost, and a new leader was elected 5 seconds after. 
However, after this temporary disconnection, our nimbus could no longer elect 
itself leader. Should Storm 1.2.1 handle this situation correctly, or is there 
a known bug?


Logs:

Losing connection (leader was lost at 14:32:31, new leader elected at 14:32:36):

2019-09-23 14:32:33,216 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n06:25173)] Unable to read additional data from 
server sessionid 0x100805579713ea8, likely server has closed socket, closing 
socket connection and attempting reconnect
  33 2019-09-23 14:32:33,217 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n08:25173)] Unable to read additional data from 
server sessionid 0x30000223e30094e, likely server has closed socket, closing 
socket connection and attempting reconnect
  34 2019-09-23 14:32:33,218 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n06:25173)] Unable to read additional data from 
server sessionid 0x100805579713ea7, likely server has closed socket, closing 
socket connection and attempting reconnect
  35 2019-09-23 14:32:33,218 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n07:25173)] Unable to read additional data from 
server sessionid 0x20000287dbd08ab, likely server has closed socket, closing 
socket connection and attempting reconnect

2019-09-23 14:32:33,321 INFO  Zookeeper [Curator-ConnectionStateManager-0] 
trslnydtraap02 lost leadership.

Reconnecting:

2019-09-23 14:32:36,284 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n07:25173)] Opening socket connection to server 
dob2-bfs-r5n06/10.122.73.209:25173. Will not attempt to authenticate using SASL 
(unknown error)
  83 2019-09-23 14:32:36,285 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n07:25173)] Socket connection established to 
dob2-bfs-r5n06/10.122.73.209:25173, initiating session
  84 2019-09-23 14:32:36,288 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n07:25173)] Session establishment complete on 
server dob2-bfs-r5n06/10.122.73.209:25173, sessionid = 0x20000287dbd08ab, 
negotiated timeout = 40000
  85 2019-09-23 14:32:36,288 INFO  ConnectionStateManager [main-EventThread] 
State change: RECONNECTED
  86 2019-09-23 14:32:36,288 INFO  zookeeper [main-EventThread] Zookeeper state 
update: :connected:none
  87 2019-09-23 14:32:36,454 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n08:25173)] Opening socket connection to server 
dob2-bfs-r5n08/10.122.73.211:25173. Will not attempt to authenticate using SASL 
(unknown error)
  88 2019-09-23 14:32:36,455 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n08:25173)] Socket connection established to 
dob2-bfs-r5n08/10.122.73.211:25173, initiating session
  89 2019-09-23 14:32:36,458 INFO  ClientCnxn 
[main-SendThread(dob2-bfs-r5n08:25173)] Session establishment complete on 
server dob2-bfs-r5n08/10.122.73.211:25173, sessionid = 0x30000223e30094e, 
negotiated timeout = 40000
  90 2019-09-23 14:32:36,458 INFO  ConnectionStateManager [main-EventThread] 
State change: RECONNECTED
  91 2019-09-23 14:32:40,342 INFO  nimbus [timer] not a leader, skipping 
assignments
  92 2019-09-23 14:32:40,342 INFO  nimbus [timer] not a leader, skipping cleanup
  93 2019-09-23 14:32:50,343 INFO  nimbus [timer] not a leader, skipping 
assignments
  94 2019-09-23 14:32:50,343 INFO  nimbus [timer] not a leader, skipping cleanup
  95 2019-09-23 14:33:00,343 INFO  nimbus [timer] not a leader, skipping 
assignments
  96 2019-09-23 14:33:00,344 INFO  nimbus [timer] not a leader, skipping cleanup
  97 2019-09-23 14:33:10,344 INFO  nimbus [timer] not a leader, skipping 
assignments
  98 2019-09-23 14:33:10,345 INFO  nimbus [timer] not a leader, skipping cleanup
  99 2019-09-23 14:33:20,345 INFO  nimbus [timer] not a leader, skipping 
assignments


From: user@storm.apache.org At: 09/23/19 12:46:31To:  user@storm.apache.org
Subject: Leader Election issues on cluster restart

We are currently running a storm cluster on one machine. So there is one 
nimbus/supervisor instance in a given cluster. We have recently had issues 
where Nimbus was started and was unable to become leader. There were no other 
instances running at this time. The cluster we seemingly brought down 
successfully:

1609 2019-09-21 22:12:47,518 INFO  nimbus [Thread-7] Shutting down master
1610 2019-09-21 22:12:47,520 INFO  CuratorFrameworkImpl [Curator-Framework-0] 
backgroundOperationsLoop exiting
1611 2019-09-21 22:12:47,527 INFO  ZooKeeper [Thread-7] Session: 
0x30000223e30079a closed
1612 2019-09-21 22:12:47,527 INFO  ClientCnxn [main-EventThread] EventThread 
shut down
1613 2019-09-21 22:12:47,528 INFO  CuratorFrameworkImpl [Curator-Framework-0] 
backgroundOperationsLoop exiting
1614 2019-09-21 22:12:47,533 INFO  ClientCnxn [main-EventThread] EventThread 
shut down
1615 2019-09-21 22:12:47,533 INFO  ZooKeeper [Thread-7] Session: 
0x30000223e30079b closed
1616 2019-09-21 22:12:47,534 INFO  CuratorFrameworkImpl [Curator-Framework-0] 
backgroundOperationsLoop exiting
1617 2019-09-21 22:12:47,539 INFO  ClientCnxn [main-EventThread] EventThread 
shut down
1618 2019-09-21 22:12:47,539 INFO  ZooKeeper [Thread-7] Session: 
0x30000223e300798 closed
1619 2019-09-21 22:12:47,539 INFO  nimbus [Thread-7] Shut down master


And then brought back up 20 minutes later. When brought up, we immediately 
started seeing:

2019-09-21 22:32:47,082 INFO  JmxPreparableReporter [main] Preparing...
2019-09-21 22:32:47,098 INFO  common [main] Started statistics report plugin...
2019-09-21 22:32:47,140 INFO  nimbus [main] Starting nimbus server for storm 
version '1.2.1'
2019-09-21 22:32:47,219 INFO  PlainSaslTransportPlugin [main] SASL PLAIN 
transport factory will be used
2019-09-21 22:32:47,858 INFO  nimbus [timer] not a leader, skipping assignments
2019-09-21 22:32:47,858 INFO  nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:32:47,860 INFO  nimbus [timer] not a leader, skipping credential 
renewal.
2019-09-21 22:32:49,134 INFO  AbstractSaslServerCallbackHandler 
[pool-14-thread-1] Successfully authenticated client: authenticationID = op 
authorizationID = op
2019-09-21 22:32:49,171 INFO  AbstractSaslServerCallbackHandler 
[pool-14-thread-2] Successfully authenticated client: authenticationID = op 
authorizationID = op
2019-09-21 22:32:57,858 INFO  nimbus [timer] not a leader, skipping assignments
2019-09-21 22:32:57,859 INFO  nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:33:07,860 INFO  nimbus [timer] not a leader, skipping assignments
2019-09-21 22:33:07,860 INFO  nimbus [timer] not a leader, skipping cleanup
2019-09-21 22:33:17,862 INFO  nimbus [timer] not a leader, skipping assignments

followed shortly by:

2019-09-21 22:33:52,409 WARN  nimbus [pool-14-thread-7] Topology submission 
exception. (topology name='WingmanTopology4159') #error {
:cause not a leader, current leader is NimbusInfo{host='trslnydtraap01', 
port=30553, isLeader=true}
:via
 [{:type java.lang.RuntimeException
   :message not a leader, current leader is NimbusInfo{host='trslnydtraap01', 
port=30553, isLeader=true}
   :at [org.apache.storm.daemon.nimbus$is_leader doInvoke nimbus.clj 150]}]
:trace


What could cause this election issue? If no other leader processes are running 
or known in the cluster, I am assuming that some sort of cluster state was not 
cleaned up correctly, either in ZooKeeper or on disk. In general, how does 
Storm mark whether there is a leader or not in a cluster? What could be the 
cause of the issue posted above?

Re:Leader Election issues on cluster restart

Reply via email to