Saloni created SPARK-33943: ------------------------------ Summary: Zookeeper LeaderElection Agent not being called by Spark Master Key: SPARK-33943 URL: https://issues.apache.org/jira/browse/SPARK-33943 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. Operating System - RHEL 6.6 Reporter: Saloni
I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence: # zookeeper-1 # sparkmaster-1 # sparkmaster-2 # zookeeper-2 # zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 >zookeeper services are up), spark master is successfully able to create a >zookeeper session. Until zookeeper-2 is up, it re-tries session creation. >However, after both zookeeper services are up and Persistence Engine is able >to successfully connect and create a session; *the ZooKeeper LeaderElection >Agent is not called*. Logs (spark-master.log): {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState ##### Only zookeeper-2 is online ##### 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ... ... ... 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ##### zookeeper-2, zookeeper-3 are online ##### 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code} Steps to reproduce: Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms # All zookeepers and spark masters are offline # Online zookeeper-2 # Online both spark-masters # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3 # Online zookeeper-1 Questions: # The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then? # Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org