Hello, Request you to please help me out on the below queries:
I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence: 1. zookeeper-1 2. sparkmaster-1 3. sparkmaster-2 4. zookeeper-2 5. zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 zookeeper services are up), spark master is successfully able to create a zookeeper session. Until zookeeper-2 is up, it re-tries session creation. However, after both zookeeper services are up and Persistence Engine is able to successfully connect and create a session; the ZooKeeper LeaderElection Agent is not called. Logs: 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState ##### Only zookeeper-2 is online ##### 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 2020-12-18 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ##### zookeeper-2, zookeeper-3 are online ##### 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED Questions: 1. The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then? 2. Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper? Any insight on this is appreciated. Thanks & Regards, Saloni R. Mehta