[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762555#comment-15762555 ]
Ed Tyrrill commented on SPARK-15544: ------------------------------------ I'm going to add that this is very easy to reproduce. It will happen reliably if you shut down the zookeeper node that is currently the leader. I configured systemd to automatically restart the spark master, and while the spark master process starts, the spark master on all three nodes doesn't really work, and continually tries to reconnect to zookeeper until I bring up the shutdown zookeeper node. Spark should be able to work with two of the three zookeeper nodes, but instead it log message like this repeatedly every couple seconds on all three spark master nodes until I bring back up the one zookeeper node that I shut down, zk02: 2016-12-19 14:31:10.175 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk01/10.0.xx.xx:xxxx. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.176 INFO org.apache.zookeeper.ClientCnxn.primeConnection - Socket connection established to zk01/10.0.xx.xx:xxxx, initiating session 2016-12-19 14:31:10.177 INFO org.apache.zookeeper.ClientCnxn.run - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 2016-12-19 14:31:10.724 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk02/10.0.xx.xx:xxxx. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.725 WARN org.apache.zookeeper.ClientCnxn.run - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-12-19 14:31:10.828 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk03/10.0.xx.xx:xxxx. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.830 INFO org.apache.zookeeper.ClientCnxn.primeConnection - Socket connection established to zk03/10.0.xx.xx:xxxx, initiating session Zookeeper itself has selected a new leader, and Kafka, which also uses zookeeper, doesn't have any trouble during this time. Also important to note, if you shut down a non-leader zookeeper node then spark doesn't have any trouble either. > Bouncing Zookeeper node causes Active spark master to exit > ---------------------------------------------------------- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum > Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org