That was/is the original intent. ZK was built to "fail fast" when it didn't know how to handle a particular case, or that case might be error prone to handle. The expectation is that the parent will restart the ZK server process when it fails.
Patrick On Wed, May 22, 2019 at 6:27 PM Qian Zhang <[email protected]> wrote: > Hi Andor, > > I am using ZooKeeper release 3.4.10. > > I checked the code, if follower fails to read from leader (e.g., read > timeout), it will close the socket, see > > https://github.com/apache/zookeeper/blob/release-3.4.10/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L91:L85 > for > details. And once the socket is close, it will make follower fails to write > (I guess same socket is used here) which will be treated as an severe > unrecoverable error, and then shutdown follower, see > > https://github.com/apache/zookeeper/blob/release-3.4.10/src/java/main/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java#L90:L95 > and > > https://github.com/apache/zookeeper/blob/release-3.4.10/src/java/main/org/apache/zookeeper/server/ZooKeeperCriticalThread.java#L48:L51 > . > > So it seems shutting down follower when it cannot read from leader is the > design behavior? Or if my understanding is wrong can you please let me know > the design behavior in this case? Thanks! > > > Regards, > Qian Zhang > > > On Wed, May 22, 2019 at 8:52 AM Qian Zhang <[email protected]> wrote: > > > Anyone has any ideas? > > > > Regards, > > Qian Zhang > > > > > > On Sun, May 19, 2019 at 6:15 PM Qian Zhang <[email protected]> wrote: > > > >> Hi, > >> > >> I have a ZooKeeper cluster which has 5 nodes. Today the leader cannot be > >> connected due to a hardware issue, and then I found the 4 followers just > >> shutdown, here is the logs: > >> > >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] WARN > >>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > >>> following the leader > >>> java.net.SocketTimeoutException: > >>> Read timed out > >>> at > >>> java.net.SocketInputStream.socketRead0(Native Method) > >>> at > >>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > >>> at > >>> java.net.SocketInputStream.read(SocketInputStream.java:171) > >>> at > >>> java.net.SocketInputStream.read(SocketInputStream.java:141) > >>> at > >>> java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > >>> at > >>> java.io.BufferedInputStream.read(BufferedInputStream.java:265) > >>> at > >>> java.io.DataInputStream.readInt(DataInputStream.java:387) > >>> at > >>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > >>> at > >>> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > >>> at > >>> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > >>> at > >>> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > >>> at > >>> > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > >>> at > >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937) > >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] INFO > >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - > >>> Accepted socket connectio > >>> n from /10.249.255.10:42306 > >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] WARN > >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@896] - > >>> Connection request from old cl > >>> ient /10.249.255.10:42306; will be dropped if server is in r-o mode > >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] INFO > >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@942] - > >>> Client attempting to establish > >>> new session at /10.249.255.10:42306 > >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] ERROR > >>> [FollowerRequestProcessor:1:ZooKeeperCriticalThread@49] - Severe > >>> unrecoverable error, from threa > >>> d : FollowerRequestProcessor:1 > >>> java.net.SocketException: Socket > >>> closed > >>> at > >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) > >>> at > >>> java.net.SocketOutputStream.write(SocketOutputStream.java:155) > >>> at > >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > >>> at > >>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > >>> at > >>> > org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139) > >>> at > >>> org.apache.zookeeper.server.quorum.Learner.request(Learner.java:188) > >>> at > >>> > org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:90) > >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] INFO > >>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown > called > >>> java.lang.Exception: shutdown > >>> Follower > >>> at > >>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > >>> at > >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941) > >> > >> > >> I am confused why all followers shutdown in this case which makes the > >> whole ZooKeeper unusable for a short period, shouldn't they elect a new > >> leader instead? Thanks! > >> > >> > >> Regards, > >> Qian Zhang > >> > > >
