[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

Ed Tyrrill (JIRA) Mon, 19 Dec 2016 14:56:12 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762555#comment-15762555
 ]


Ed Tyrrill commented on SPARK-15544:
------------------------------------

I'm going to add that this is very easy to reproduce.  It will happen reliably 
if you shut down the zookeeper node that is currently the leader.  I configured 
systemd to automatically restart the spark master, and while the spark master 
process starts, the spark master on all three nodes doesn't really work, and 
continually tries to reconnect to zookeeper until I bring up the shutdown 
zookeeper node.  Spark should be able to work with two of the three zookeeper 
nodes, but instead it log message like this repeatedly every couple seconds on 
all three spark master nodes until I bring back up the one zookeeper node that 
I shut down, zk02:

2016-12-19 14:31:10.175 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - 
Opening socket connection to server zk01/10.0.xx.xx:xxxx. Will not attempt to 
authenticate using SASL (unknown error)
2016-12-19 14:31:10.176 INFO org.apache.zookeeper.ClientCnxn.primeConnection - 
Socket connection established to zk01/10.0.xx.xx:xxxx, initiating session
2016-12-19 14:31:10.177 INFO org.apache.zookeeper.ClientCnxn.run - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect
2016-12-19 14:31:10.724 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - 
Opening socket connection to server zk02/10.0.xx.xx:xxxx. Will not attempt to 
authenticate using SASL (unknown error)
2016-12-19 14:31:10.725 WARN org.apache.zookeeper.ClientCnxn.run - Session 0x0 
for server null, unexpected error, closing socket connection and attempting 
reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-12-19 14:31:10.828 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - 
Opening socket connection to server zk03/10.0.xx.xx:xxxx. Will not attempt to 
authenticate using SASL (unknown error)
2016-12-19 14:31:10.830 INFO org.apache.zookeeper.ClientCnxn.primeConnection - 
Socket connection established to zk03/10.0.xx.xx:xxxx, initiating session

Zookeeper itself has selected a new leader, and Kafka, which also uses 
zookeeper, doesn't have any trouble during this time.  Also important to note, 
if you shut down a non-leader zookeeper node then spark doesn't have any 
trouble either.

> Bouncing Zookeeper node causes Active spark master to exit
> ----------------------------------------------------------
>
>                 Key: SPARK-15544
>                 URL: https://issues.apache.org/jira/browse/SPARK-15544
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>            Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

Reply via email to