[ https://issues.apache.org/jira/browse/ZOOKEEPER-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16739751#comment-16739751 ]
Brian Nixon commented on ZOOKEEPER-2669: ---------------------------------------- Is this related to ZOOKEEPER-3240 ? > follower failed to reconnect to leader after a network error > ------------------------------------------------------------- > > Key: ZOOKEEPER-2669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2669 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.4.9 > Environment: CentOS7 > Reporter: Zhenghua Chen > Priority: Major > > We have a zookeeper cluster with 3 nodes named s1, s2, s3 > By mistake, we shut down the ethernet interface of s2, and zk follower shut > down(zk process remains there) > Later, after ethernet up again, s2 failed to reconnect to leader s3 to be a > follower > follower s2 keeps printing log like this: > {quote} > 2017-01-19 16:40:58,956 WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:7181] > o.a.z.s.q.Learner - Got zxid 0x320001019f expected 0x1 > 2017-01-19 16:40:58,956 ERROR [SyncThread:1] o.a.z.s.ZooKeeperCriticalThread > - Severe unrecoverable error, from thread : SyncThread:1 > java.nio.channels.ClosedChannelException: null > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:250) > at > org.apache.zookeeper.server.persistence.Util.padLogFile(Util.java:215) > at > org.apache.zookeeper.server.persistence.FileTxnLog.padFile(FileTxnLog.java:241) > at > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:219) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:314) > at org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:470) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:140) > 2017-01-19 16:40:58,956 INFO [SyncThread:1] > o.a.z.s.ZooKeeperServerListenerImpl - Thread SyncThread:1 exits, error code 1 > 2017-01-19 16:40:58,956 INFO [SyncThread:1] o.a.z.s.SyncRequestProcessor - > SyncRequestProcessor exited! > 2017-01-19 16:40:58,957 INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:7181] > o.a.z.s.q.Learner - shutdown called > java.lang.Exception: shutdown Follower > at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:164) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) > {quote} > And, leader s3 keeps printing log like this: > {quote} > 2017-01-19 16:30:50,452 INFO [LearnerHandler-/192.168.40.51:35949] > o.a.z.s.q.LearnerHandler - Follower sid: 1 : info : > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@95258f0 > 2017-01-19 16:30:50,452 INFO [LearnerHandler-/192.168.40.51:35949] > o.a.z.s.q.LearnerHandler - Synchronizing with Follower sid: 1 > maxCommittedLog=0x320001019e minCommittedLog=0x320000ffaa > peerLastZxid=0x2300000000 > 2017-01-19 16:30:50,453 WARN [LearnerHandler-/192.168.40.51:35949] > o.a.z.s.q.LearnerHandler - Unhandled proposal scenario > 2017-01-19 16:30:50,453 INFO [LearnerHandler-/192.168.40.51:35949] > o.a.z.s.q.LearnerHandler - Sending SNAP > 2017-01-19 16:30:50,453 INFO [LearnerHandler-/192.168.40.51:35949] > o.a.z.s.q.LearnerHandler - Sending snapshot last zxid of peer is 0x2300000000 > zxid of leader is 0x320001019esent zxid of db as 0x320001019e > 2017-01-19 16:30:50,461 INFO [LearnerHandler-/192.168.40.51:35949] > o.a.z.s.q.LearnerHandler - Received NEWLEADER-ACK message from 1 > 2017-01-19 16:30:51,738 ERROR [LearnerHandler-/192.168.40.51:35934] > o.a.z.s.q.LearnerHandler - Unexpected exception causing shutdown while sock > still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542) > {quote} > we execute netstat, found lots of close wait socket in s2, and never closed. > {quote} > tcp6 10865 0 192.168.40.51:47181 192.168.40.57:7288 > CLOSE_WAIT 2217/java > tcp6 2576 0 192.168.40.51:57181 192.168.40.57:7288 > CLOSE_WAIT 2217/java > {quote} > seems that s2 has a connection leak. > after restart zk process of s2, it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)