[ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saloni updated SPARK-33943: --------------------------- Description: I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. I am using spark in standalone mode. The services come up online in the below sequence: # zookeeper-1 # sparkmaster-1 # sparkmaster-2 # zookeeper-2 # zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 >zookeeper services are up), spark master is successfully able to create a >zookeeper session. Until zookeeper-2 is up, it re-tries session creation. >However, after both zookeeper services are up and Persistence Engine is able >to successfully connect and create a session; *the ZooKeeper LeaderElection >Agent is not called*. Logs (spark-master.log): {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState ##### Only zookeeper-2 is online ##### 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ... ... ... 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ##### zookeeper-2, zookeeper-3 are online ##### 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code} Steps to reproduce: Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms # All zookeepers and spark masters are offline # Online zookeeper-2 # Online both spark-masters # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3 # Online zookeeper-1 Questions: # The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then? # Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper? was: I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence: # zookeeper-1 # sparkmaster-1 # sparkmaster-2 # zookeeper-2 # zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 >zookeeper services are up), spark master is successfully able to create a >zookeeper session. Until zookeeper-2 is up, it re-tries session creation. >However, after both zookeeper services are up and Persistence Engine is able >to successfully connect and create a session; *the ZooKeeper LeaderElection >Agent is not called*. Logs (spark-master.log): {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState ##### Only zookeeper-2 is online ##### 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ... ... ... 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ##### zookeeper-2, zookeeper-3 are online ##### 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code} Steps to reproduce: Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms # All zookeepers and spark masters are offline # Online zookeeper-2 # Online both spark-masters # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3 # Online zookeeper-1 Questions: # The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then? # Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper? > Zookeeper LeaderElection Agent not being called by Spark Master > --------------------------------------------------------------- > > Key: SPARK-33943 > URL: https://issues.apache.org/jira/browse/SPARK-33943 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.0 > Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. > Operating System - RHEL 6.10 > Reporter: Saloni > Priority: Major > > I have 2 spark masters and 3 zookeepers deployed on my system on separate > virtual machines. I am using spark in standalone mode. > The services come up online in the below sequence: > # zookeeper-1 > # sparkmaster-1 > # sparkmaster-2 > # zookeeper-2 > # zookeeper-3 > The above sequence leads to both the spark masters running in STANDBY mode. > From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 > zookeeper services are up), spark master is successfully able to create a > zookeeper session. Until zookeeper-2 is up, it re-tries session creation. > However, after both zookeeper services are up and Persistence Engine is able > to successfully connect and create a session; *the ZooKeeper LeaderElection > Agent is not called*. > Logs (spark-master.log): > {code:java} > 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery > state to ZooKeeper Initiating client connection, > connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx > sessionTimeout=60000 watcher=org.apache.curator.ConnectionState > ##### Only zookeeper-2 is online ##### > 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-1:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-1:xxxx: No route to host > 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-2:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-2:xxxx, initiating session > 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to > read additional data from server sessionid 0x0, likely server has closed > socket, closing socket connection and attempting reconnect > 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-3:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-3:xxxx: Connection refused > 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / > elapsed (15274) > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / > elapsed (35297) org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > ... > ... > ... > 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / > elapsed (55301) org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt > unsuccessful after 105305 (greater than max timeout of 60000). Resetting > connection and trying again with a new connection. > 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed > 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client > connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx > sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ > 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - > EventThread shut down for session: 0x0 > 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /x/y > at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > ##### zookeeper-2, zookeeper-3 are online ##### > 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-2:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-2:xxxx, initiating session > 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to > read additional data from server sessionid 0x0, likely server has closed > socket, closing socket connection and attempting reconnect > 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-1:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-1:xxxx: No route to host > 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-3:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-3:xxxx, initiating session > 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to > read additional data from server sessionid 0x0, likely server has closed > socket, closing socket connection and attempting reconnect > 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-3:xxxx. Will not attempt to > authenticate using SASL (unknown error) > 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-3:xxxx, initiating session > 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session > establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, > negotiated timeout = 40000 > 10:05:57.580 INFO > org.apache.curator.framework.state.ConnectionStateManager:228 - State change: > CONNECTED {code} > Steps to reproduce: > Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms > # All zookeepers and spark masters are offline > # Online zookeeper-2 > # Online both spark-masters > # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3 > # Online zookeeper-1 > > Questions: > # The last line from the logs above indicates that a zookeeper session was > successfully established. Why is the Zookeeper LeaderElection Agent not being > called then? > # Is there any configuration that we can do in spark so as to increase the > number of retries/timeouts while connecting to zookeeper? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org