Hello,

Request you to please help me out on the below queries:

I have 2 spark masters and 3 zookeepers deployed on my system on separate
virtual machines. The services come up online in the below sequence:

   1. zookeeper-1
   2. sparkmaster-1
   3. sparkmaster-2
   4. zookeeper-2
   5. zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

>From the logs, I can see that only after zookeeper-2 service comes up (i.e.
2 zookeeper services are up), spark master is successfully able to create a
zookeeper session. Until zookeeper-2 is up, it re-tries session creation.
However, after both zookeeper services are up and Persistence Engine is
able to successfully connect and create a session; the ZooKeeper
LeaderElection Agent is not called.
Logs:

    10:03:47.241 INFO  org.apache.spark.internal.Logging:57 -
Persisting recovery state to
    ZooKeeper
    Initiating client connection,
connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx
    sessionTimeout=60000 watcher=org.apache.curator.ConnectionState

    ##### Only zookeeper-2 is online #####

    10:03:47.630 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket connection to
    server zookeeper-1:xxxx. Will not attempt to authenticate using
SASL (unknown error)
    10:03:50.635 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1162
- Socket error occurred:
    zookeeper-1:xxxx: No route to host
    10:03:50.738 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket
    connection to server zookeeper-2:xxxx. Will not attempt to
authenticate using SASL (unknown
    error)
    2020-12-18 10:03:50.739 INFO
org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection
    established to zookeeper-2:xxxx, initiating session
    10:03:50.742 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1158
- Unable to read
    additional data from server sessionid 0x0, likely server has
closed socket, closing socket
    connection and attempting reconnect
    10:03:51.842 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket
    connection to server zookeeper-3:xxxx. Will not attempt to
authenticate using SASL (unknown
    error)
    10:03:51.843 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1162
- Socket error
    occurred: zookeeper-3:xxxx: Connection refused

    10:04:02.685 ERROR org.apache.curator.ConnectionState:200 -
Connection timed out for connection
    string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and
timeout (15000) / elapsed (15274)
    org.apache.curator.CuratorConnectionLossException: KeeperErrorCode
= ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)

    10:04:22.691 ERROR org.apache.curator.ConnectionState:200 -
Connection timed out for connection
    string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and
timeout (15000) / elapsed (35297)
    org.apache.curator.CuratorConnectionLossException: KeeperErrorCode
= ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)

    10:04:42.696 ERROR org.apache.curator.ConnectionState:200 -
Connection timed out for connection
    string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and
timeout (15000) / elapsed (55301)
    org.apache.curator.CuratorConnectionLossException: KeeperErrorCode
= ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)

    10:05:32.699 WARN  org.apache.curator.ConnectionState:191 -
Connection attempt unsuccessful after
    105305 (greater than max timeout of 60000). Resetting connection
and trying again with a new
    connection.
    10:05:32.864 INFO  org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed
    10:05:32.865 INFO  org.apache.zookeeper.ZooKeeper:442 - Initiating
client connection,
    connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx
sessionTimeout=60000
    watcher=org.apache.curator.ConnectionState@
    10:05:32.864 INFO  org.apache.zookeeper.ClientCnxn$EventThread:522
- EventThread shut down for
    session: 0x0

    10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error
    org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
    for /x/y
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)

    ##### zookeeper-2, zookeeper-3 are online #####

    10:05:47.357 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket connection to
    server zookeeper-2:xxxx. Will not attempt to authenticate using
SASL (unknown error)
    10:05:47.358 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879
- Socket connection established
    to zookeeper-2:xxxx, initiating session
    10:05:47.359 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1158
- Unable to read additional
    data from server sessionid 0x0, likely server has closed socket,
closing socket connection and
    attempting reconnect
    10:05:47.528 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket connection to
    server zookeeper-1:xxxx. Will not attempt to authenticate using
SASL (unknown error)
    10:05:50.529 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1162
- Socket error occurred:
    zookeeper-1:xxxx: No route to host
    10:05:51.454 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket connection to
    server zookeeper-3:xxxx. Will not attempt to authenticate using
SASL (unknown error)
    10:05:51.455 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879
- Socket connection established
    to zookeeper-3:xxxx, initiating session
    10:05:51.457 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1158
- Unable to read additional
    data from server sessionid 0x0, likely server has closed socket,
closing socket connection and
    attempting reconnect

    10:05:57.564 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025
- Opening socket connection to
    server zookeeper-3:xxxx. Will not attempt to authenticate using
SASL (unknown error)
    10:05:57.566 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879
- Socket connection established
    to zookeeper-3:xxxx, initiating session
    10:05:57.574 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1299
- Session establishment
    complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated
timeout = 40000
    10:05:57.580 INFO
org.apache.curator.framework.state.ConnectionStateManager:228 - State
change:
    CONNECTED

Questions:

   1. The last line from the logs above indicates that a zookeeper session
   was successfully established. Why is the Zookeeper LeaderElection Agent not
   being called then?
   2. Is there any configuration that we can do in spark so as to increase
   the number of retries/timeouts while connecting to zookeeper?

Any insight on this is appreciated.

Thanks & Regards,
Saloni R. Mehta

Reply via email to