Saloni created SPARK-33943:
------------------------------

             Summary: Zookeeper LeaderElection Agent not being called by Spark 
Master
                 Key: SPARK-33943
                 URL: https://issues.apache.org/jira/browse/SPARK-33943
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.0
         Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
Operating System - RHEL 6.6
            Reporter: Saloni


I have 2 spark masters and 3 zookeepers deployed on my system on separate 
virtual machines. The services come up online in the below sequence:
 # zookeeper-1
 # sparkmaster-1
 # sparkmaster-2
 # zookeeper-2
 # zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

>From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 
>zookeeper services are up), spark master is successfully able to create a 
>zookeeper session. Until zookeeper-2 is up, it re-tries session creation. 
>However, after both zookeeper services are up and Persistence Engine is able 
>to successfully connect and create a session; *the ZooKeeper LeaderElection 
>Agent is not called*.

Logs (spark-master.log):
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery 
state to ZooKeeper Initiating client connection, 
connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx 
sessionTimeout=60000 watcher=org.apache.curator.ConnectionState

##### Only zookeeper-2 is online #####

10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-1:xxxx: No route to host
10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate 
using SASL (unknown error)
10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-2:xxxx, initiating session
10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect
10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate 
using SASL (unknown error)
10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket 
error occurred: zookeeper-3:xxxx: Connection refused 
10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) 
and timeout (15000) / elapsed (15274)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) 
and timeout (15000) / elapsed (35297) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
...
...
...
10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed 
out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) 
and timeout (15000) / elapsed (55301) 
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt 
unsuccessful after 105305 (greater than max timeout of 60000). Resetting 
connection and trying again with a new connection. 
10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 
10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client 
connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx 
sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 
10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread 
shut down for session: 0x0 
10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /x/y at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 

##### zookeeper-2, zookeeper-3 are online ##### 

10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate 
using SASL (unknown error) 
10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-2:xxxx, initiating session 
10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect 
10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate 
using SASL (unknown error) 10:05:50.529 INFO 
org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: 
zookeeper-1:xxxx: No route to host 
10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate 
using SASL (unknown error) 
10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-3:xxxx, initiating session 
10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect 
10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening 
socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate 
using SASL (unknown error) 
10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket 
connection established to zookeeper-3:xxxx, initiating session 
10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session 
establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated 
timeout = 40000 
10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 
- State change: CONNECTED {code}
Steps to reproduce:

Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms
 # All zookeepers and spark masters are offline
 # Online zookeeper-2
 # Online both spark-masters
 # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3
 # Online zookeeper-1

 

Questions:
 # The last line from the logs above indicates that a zookeeper session was 
successfully established. Why is the Zookeeper LeaderElection Agent not being 
called then?
 # Is there any configuration that we can do in spark so as to increase the 
number of retries/timeouts while connecting to zookeeper?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to