[
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762555#comment-15762555
]
Ed Tyrrill commented on SPARK-15544:
I'm going to add that this is very easy to reproduce. It will happen reliably
if you shut down the zookeeper node that is currently the leader. I configured
systemd to automatically restart the spark master, and while the spark master
process starts, the spark master on all three nodes doesn't really work, and
continually tries to reconnect to zookeeper until I bring up the shutdown
zookeeper node. Spark should be able to work with two of the three zookeeper
nodes, but instead it log message like this repeatedly every couple seconds on
all three spark master nodes until I bring back up the one zookeeper node that
I shut down, zk02:
2016-12-19 14:31:10.175 INFO org.apache.zookeeper.ClientCnxn.logStartConnect -
Opening socket connection to server zk01/10.0.xx.xx:. Will not attempt to
authenticate using SASL (unknown error)
2016-12-19 14:31:10.176 INFO org.apache.zookeeper.ClientCnxn.primeConnection -
Socket connection established to zk01/10.0.xx.xx:, initiating session
2016-12-19 14:31:10.177 INFO org.apache.zookeeper.ClientCnxn.run - Unable to
read additional data from server sessionid 0x0, likely server has closed
socket, closing socket connection and attempting reconnect
2016-12-19 14:31:10.724 INFO org.apache.zookeeper.ClientCnxn.logStartConnect -
Opening socket connection to server zk02/10.0.xx.xx:. Will not attempt to
authenticate using SASL (unknown error)
2016-12-19 14:31:10.725 WARN org.apache.zookeeper.ClientCnxn.run - Session 0x0
for server null, unexpected error, closing socket connection and attempting
reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-12-19 14:31:10.828 INFO org.apache.zookeeper.ClientCnxn.logStartConnect -
Opening socket connection to server zk03/10.0.xx.xx:. Will not attempt to
authenticate using SASL (unknown error)
2016-12-19 14:31:10.830 INFO org.apache.zookeeper.ClientCnxn.primeConnection -
Socket connection established to zk03/10.0.xx.xx:, initiating session
Zookeeper itself has selected a new leader, and Kafka, which also uses
zookeeper, doesn't have any trouble during this time. Also important to note,
if you shut down a non-leader zookeeper node then spark doesn't have any
trouble either.
> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit. The
> master should have connected to a second zookeeper node.
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data
> from server sessionid 0x154dfc0426b0054, likely server has closed socket,
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data
> from server sessionid 0x254c701f28d0053, likely server has closed socket,
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master
> shutting down. }}
> {code}
> spark-env.sh:
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}
--
This message was sent by Atlassian