[ 
https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256965#comment-17256965
 ] 

Saloni commented on SPARK-33943:
--------------------------------

If we increase the timeouts/no. of retries, will that resolve the issue i.e. 
will it ensure that the ZooKeeper LeaderElection Agent is called?

Because, the crux of it boils down to understanding why after successful 
establishment of the session, LeaderElection Agent is not called.

> Zookeeper LeaderElection Agent not being called by Spark Master
> ---------------------------------------------------------------
>
>                 Key: SPARK-33943
>                 URL: https://issues.apache.org/jira/browse/SPARK-33943
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>         Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
>  Operating System - RHEL 6.10
>            Reporter: Saloni
>            Priority: Major
>
> I have 2 spark masters and 3 zookeepers deployed on my system on separate 
> virtual machines. I am using spark in standalone mode.
> The services come up online in the below sequence:
>  # zookeeper-1
>  # sparkmaster-1
>  # sparkmaster-2
>  # zookeeper-2
>  # zookeeper-3
> The above sequence leads to both the spark masters running in STANDBY mode.
> From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
> zookeeper services are up), spark master is successfully able to create a 
> zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
> However, after both zookeeper services are up and Persistence Engine is able 
> to successfully connect and create a session; *the ZooKeeper LeaderElection 
> Agent is not called*.
> Logs (spark-master.log):
> {code:java}
> 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
> state to ZooKeeper Initiating client connection, 
> connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx 
> sessionTimeout=60000 watcher=org.apache.curator.ConnectionState
> ##### Only zookeeper-2 is online #####
> 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-1:xxxx. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-1:xxxx: No route to host
> 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-2:xxxx. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-2:xxxx, initiating session
> 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
> read additional data from server sessionid 0x0, likely server has closed 
> socket, closing socket connection and attempting reconnect
> 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-3:xxxx. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-3:xxxx: Connection refused 
> 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / 
> elapsed (15274)
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / 
> elapsed (35297) org.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
> ...
> ...
> ...
> 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
> out for connection string 
> (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / 
> elapsed (55301) org.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss 
>   at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
> unsuccessful after 105305 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection. 
> 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 
> 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client 
> connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx 
> sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 
> 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - 
> EventThread shut down for session: 0x0 
> 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /x/y 
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
> ##### zookeeper-2, zookeeper-3 are online ##### 
> 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-2:xxxx. Will not attempt to 
> authenticate using SASL (unknown error) 
> 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-2:xxxx, initiating session 
> 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
> read additional data from server sessionid 0x0, likely server has closed 
> socket, closing socket connection and attempting reconnect 
> 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-1:xxxx. Will not attempt to 
> authenticate using SASL (unknown error)
> 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
> error occurred: zookeeper-1:xxxx: No route to host 
> 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-3:xxxx. Will not attempt to 
> authenticate using SASL (unknown error) 
> 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-3:xxxx, initiating session 
> 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
> read additional data from server sessionid 0x0, likely server has closed 
> socket, closing socket connection and attempting reconnect 
> 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
> socket connection to server zookeeper-3:xxxx. Will not attempt to 
> authenticate using SASL (unknown error) 
> 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
> connection established to zookeeper-3:xxxx, initiating session 
> 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session 
> establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, 
> negotiated timeout = 40000 
> 10:05:57.580 INFO 
> org.apache.curator.framework.state.ConnectionStateManager:228 - State change: 
> CONNECTED {code}
> Steps to reproduce:
> Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms
>  # All zookeepers and spark masters are offline
>  # Online zookeeper-2
>  # Online both spark-masters
>  # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3
>  # Online zookeeper-1
>  
> Questions:
>  # The last line from the logs above indicates that a zookeeper session was 
> successfully established. Why is the Zookeeper LeaderElection Agent not being 
> called then?
>  # Is there any configuration that we can do in spark so as to increase the 
> number of retries/timeouts while connecting to zookeeper?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to