[jira] [Commented] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master
[ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256965#comment-17256965 ] Saloni commented on SPARK-33943: If we increase the timeouts/no. of retries, will that resolve the issue i.e. will it ensure that the ZooKeeper LeaderElection Agent is called? Because, the crux of it boils down to understanding why after successful establishment of the session, LeaderElection Agent is not called. > Zookeeper LeaderElection Agent not being called by Spark Master > --- > > Key: SPARK-33943 > URL: https://issues.apache.org/jira/browse/SPARK-33943 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 > Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. > Operating System - RHEL 6.10 >Reporter: Saloni >Priority: Major > > I have 2 spark masters and 3 zookeepers deployed on my system on separate > virtual machines. I am using spark in standalone mode. > The services come up online in the below sequence: > # zookeeper-1 > # sparkmaster-1 > # sparkmaster-2 > # zookeeper-2 > # zookeeper-3 > The above sequence leads to both the spark masters running in STANDBY mode. > From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 > zookeeper services are up), spark master is successfully able to create a > zookeeper session. Until zookeeper-2 is up, it re-tries session creation. > However, after both zookeeper services are up and Persistence Engine is able > to successfully connect and create a session; *the ZooKeeper LeaderElection > Agent is not called*. > Logs (spark-master.log): > {code:java} > 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery > state to ZooKeeper Initiating client connection, > connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: > sessionTimeout=6 watcher=org.apache.curator.ConnectionState > # Only zookeeper-2 is online # > 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-1:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-1:: No route to host > 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-2:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-2:, initiating session > 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to > read additional data from server sessionid 0x0, likely server has closed > socket, closing socket connection and attempting reconnect > 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-3:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-3:: Connection refused > 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (15274) > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (35297) org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > ... > ... > ... > 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (55301) org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt > unsuccessful after 105305 (greater than max timeout of 6). Resetting > connection and trying again with a new connection. > 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 c
[jira] [Commented] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master
[ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256962#comment-17256962 ] Saloni commented on SPARK-33943: As per my understanding, the re-tries going on in the logs for establishing a successful zookeeper session are for 'Persisting recovery state to ZooKeeper'. {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState {code} Once this is successfully established, then the Zookeeper LeaderElection Agent is ideally called. The last lines in the log state that a session was successfully created, it seems this was for the Persistence Engine (since for this the connection was initiated). {code:java} 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:, initiating session 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:, sessionid = , negotiated timeout = 4 10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code} What I don't understand is why the Zookeeper LeaderElection Agent was not called if sparkMaster was able to connect to the zookeepers? > Zookeeper LeaderElection Agent not being called by Spark Master > --- > > Key: SPARK-33943 > URL: https://issues.apache.org/jira/browse/SPARK-33943 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 > Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. > Operating System - RHEL 6.10 >Reporter: Saloni >Priority: Major > > I have 2 spark masters and 3 zookeepers deployed on my system on separate > virtual machines. I am using spark in standalone mode. > The services come up online in the below sequence: > # zookeeper-1 > # sparkmaster-1 > # sparkmaster-2 > # zookeeper-2 > # zookeeper-3 > The above sequence leads to both the spark masters running in STANDBY mode. > From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 > zookeeper services are up), spark master is successfully able to create a > zookeeper session. Until zookeeper-2 is up, it re-tries session creation. > However, after both zookeeper services are up and Persistence Engine is able > to successfully connect and create a session; *the ZooKeeper LeaderElection > Agent is not called*. > Logs (spark-master.log): > {code:java} > 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery > state to ZooKeeper Initiating client connection, > connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: > sessionTimeout=6 watcher=org.apache.curator.ConnectionState > # Only zookeeper-2 is online # > 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-1:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-1:: No route to host > 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-2:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-2:, initiating session > 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to > read additional data from server sessionid 0x0, likely server has closed > socket, closing socket connection and attempting reconnect > 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-3:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-3:: Connection refused > 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (15274) > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-
[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master
[ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saloni updated SPARK-33943: --- Description: I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. I am using spark in standalone mode. The services come up online in the below sequence: # zookeeper-1 # sparkmaster-1 # sparkmaster-2 # zookeeper-2 # zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 >zookeeper services are up), spark master is successfully able to create a >zookeeper session. Until zookeeper-2 is up, it re-tries session creation. >However, after both zookeeper services are up and Persistence Engine is able >to successfully connect and create a session; *the ZooKeeper LeaderElection >Agent is not called*. Logs (spark-master.log): {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState # Only zookeeper-2 is online # 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:. Will not attempt to authenticate using SASL (unknown error) 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ... ... ... 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 6). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) # zookeeper-2, zookeeper-3 are online # 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:. Will not attempt to authenticate using SASL (unknown error) 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection establ
[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master
[ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saloni updated SPARK-33943: --- Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. Operating System - RHEL 6.10 was: 2 Spark Masters KVMs and 3 Zookeeper KVMs. Operating System - RHEL 6.6 > Zookeeper LeaderElection Agent not being called by Spark Master > --- > > Key: SPARK-33943 > URL: https://issues.apache.org/jira/browse/SPARK-33943 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 > Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. > Operating System - RHEL 6.10 >Reporter: Saloni >Priority: Major > > I have 2 spark masters and 3 zookeepers deployed on my system on separate > virtual machines. The services come up online in the below sequence: > # zookeeper-1 > # sparkmaster-1 > # sparkmaster-2 > # zookeeper-2 > # zookeeper-3 > The above sequence leads to both the spark masters running in STANDBY mode. > From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 > zookeeper services are up), spark master is successfully able to create a > zookeeper session. Until zookeeper-2 is up, it re-tries session creation. > However, after both zookeeper services are up and Persistence Engine is able > to successfully connect and create a session; *the ZooKeeper LeaderElection > Agent is not called*. > Logs (spark-master.log): > {code:java} > 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery > state to ZooKeeper Initiating client connection, > connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: > sessionTimeout=6 watcher=org.apache.curator.ConnectionState > # Only zookeeper-2 is online # > 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-1:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-1:: No route to host > 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-2:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket > connection established to zookeeper-2:, initiating session > 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to > read additional data from server sessionid 0x0, likely server has closed > socket, closing socket connection and attempting reconnect > 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening > socket connection to server zookeeper-3:. Will not attempt to > authenticate using SASL (unknown error) > 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket > error occurred: zookeeper-3:: Connection refused > 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (15274) > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (35297) org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > ... > ... > ... > 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed > out for connection string > (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / > elapsed (55301) org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ... > ... > ... > 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt > unsuccessful after 105305 (greater than max timeout of 6). Resetting > connection and trying again with a new connection. > 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed > 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client > connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: > sessionTimeout=6 watcher=org.apache.curato
[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master
[ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saloni updated SPARK-33943: --- Description: I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence: # zookeeper-1 # sparkmaster-1 # sparkmaster-2 # zookeeper-2 # zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 >zookeeper services are up), spark master is successfully able to create a >zookeeper session. Until zookeeper-2 is up, it re-tries session creation. >However, after both zookeeper services are up and Persistence Engine is able >to successfully connect and create a session; *the ZooKeeper LeaderElection >Agent is not called*. Logs (spark-master.log): {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState # Only zookeeper-2 is online # 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:. Will not attempt to authenticate using SASL (unknown error) 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ... ... ... 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 6). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) # zookeeper-2, zookeeper-3 are online # 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:. Will not attempt to authenticate using SASL (unknown error) 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:, initiating
[jira] [Created] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master
Saloni created SPARK-33943: -- Summary: Zookeeper LeaderElection Agent not being called by Spark Master Key: SPARK-33943 URL: https://issues.apache.org/jira/browse/SPARK-33943 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs. Operating System - RHEL 6.6 Reporter: Saloni I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence: # zookeeper-1 # sparkmaster-1 # sparkmaster-2 # zookeeper-2 # zookeeper-3 The above sequence leads to both the spark masters running in STANDBY mode. >From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 >zookeeper services are up), spark master is successfully able to create a >zookeeper session. Until zookeeper-2 is up, it re-tries session creation. >However, after both zookeeper services are up and Persistence Engine is able >to successfully connect and create a session; *the ZooKeeper LeaderElection >Agent is not called*. Logs (spark-master.log): {code:java} 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState # Only zookeeper-2 is online # 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:. Will not attempt to authenticate using SASL (unknown error) 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:: No route to host 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:. Will not attempt to authenticate using SASL (unknown error) 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:, initiating session 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:. Will not attempt to authenticate using SASL (unknown error) 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:: Connection refused 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (15274) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ... ... ... 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:,zookeeper-3:,zookeeper-1:) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ... ... ... 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 6). Resetting connection and trying again with a new connection. 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:,zookeeper-3:,zookeeper-1: sessionTimeout=6 watcher=org.apache.curator.ConnectionState@ 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) # zookeeper-2, zookeeper-3 are online # 10:05:47.357 INFO org.apache.zookeeper.C