[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239715#comment-16239715
 ] 

Brian Nixon commented on ZOOKEEPER-2357:
----------------------------------------

Not an expert but a couple things pop out to me. One, the WARN messages are 
what you expect when a follower loses contact with the leader. Two, 50 seconds 
to sync the txn log is a long time.

I don't know what the SyncThread of the FileTxnLog is blocking but it could be 
the case that the data load is impacting the server-server communication.


> Unhandled errors propagating through cluster
> --------------------------------------------
>
>                 Key: ZOOKEEPER-2357
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2357
>             Project: ZooKeeper
>          Issue Type: Task
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.4.6
>            Reporter: Gareth Humphries
>            Priority: Minor
>
> Hi,
> I need some help understanding a recurring problem we're seeing with our 
> zookeeper cluster.  It's a five node cluster that ordinarily runs fine.  
> Occasionally we see an error from which the cluster recovers, but it causes a 
> lot of grief and I'm sure is representative of an unhealthy situation.
> To my eye it looks like an invalid bit of data getting into the system and 
> not being handled gracefully; I'm the first to say my eye is not expert 
> though, so I humbly submit an annotated log exert in the hope some who knows 
> more than me can provide some illumination.
> The cluster seems to be ticking along fine, until we get errors on 2 of the 5 
> nodes like so:
> 2016-01-19 13:12:49,698 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] 
> - Exception when following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2016-01-19 13:12:49,698 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> This is immediately followed by 380 occurences of:
> 2016-01-19 13:12:49,699 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket 
> connection for client /X.Y.Z.56:59028 which had sessionid 0x151b01ee8330234
> and a:
> 2016-01-19 13:12:49,766 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:FollowerZooKeeperServer@139] - Shutting down
> 2016-01-19 13:12:49,766 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@441] - shutting down
> 2016-01-19 13:12:49,766 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:FollowerRequestProcessor@105] - Shutting down
> 2016-01-19 13:12:49,766 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:CommitProcessor@181] - Shutting down
> 2016-01-19 13:12:49,766 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:FinalRequestProcessor@415] - shutdown of 
> request processor complete
> 2016-01-19 13:12:49,767 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:SyncRequestProcessor@209] - Shutting down
> 2016-01-19 13:12:49,767 - INFO  [CommitProcessor:1:CommitProcessor@150] - 
> CommitProcessor exited loop!
> 2016-01-19 13:12:49,767 - INFO  
> [FollowerRequestProcessor:1:FollowerRequestProcessor@95] - 
> FollowerRequestProcessor exited loop!
> 2016-01-19 13:13:09,418 - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the 
> write ahead log in SyncThread:1 took 30334ms which will adversely effect 
> operation latency. See the ZooKeeper troubleshooting guide
> 2016-01-19 13:13:09,427 - WARN  [SyncThread:1:SendAckRequestProcessor@64] - 
> Closing connection to leader, exception during packet send
> java.net.SocketException: Socket closed
>         at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         at 
> org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139)
>         at 
> org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:62)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:204)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> 2016-01-19 13:13:09,428 - INFO  [SyncThread:1:SyncRequestProcessor@187] - 
> SyncRequestProcessor exited!
> As a small aside, the fsync log errors for the first two servers to be hit 
> are:
> 2016-01-19 13:13:09,418 - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the 
> write ahead log in SyncThread:1
> took 30334ms which will adversely effect operation latency.
> 2016-01-19 13:13:09,539 - WARN  [SyncThread:2:FileTxnLog@334] - fsync-ing the 
> write ahead log in SyncThread:2
> took 30456ms which will adversely effect operation latency.
> If rewind from date of the entry the milliseconds given, you arrive within 
> one millisecond of the same time on each server.
> But I digress.
> For the next 12 minutes or so, the logs are full of the below sort of 
> exceptions, in seemingly no consistent order or frequency:
> 2016-01-19 13:13:09,440 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2016-01-19 13:13:09,441 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
> socket connection for client /X.Y.Z.181:51381 (no session established for 
> client)
> 2016-01-19 13:13:09,443 - WARN  
> [QuorumPeer[myid=1]/0.0.0.0:2181:SendAckRequestProcessor@64] - Closing 
> connection to leader, exception during packet send
> java.net.SocketException: Socket closed
>         at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         at 
> org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139)
>         at 
> org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:62)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:204)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:216)
>         at 
> org.apache.zookeeper.server.quorum.FollowerZooKeeperServer.shutdown(FollowerZooKeeperServer.java:147)
>         at 
> org.apache.zookeeper.server.quorum.Learner.shutdown(Learner.java:546)
>         at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:167)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> 2016-01-19 13:13:09,443 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumPeer@714] - LOOKING
> 2016-01-19 13:13:11,782 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of 
> stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x1525a047dc20005, likely client has closed socket
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>         at java.lang.Thread.run(Thread.java:744)
> 2016-01-19 13:13:11,783 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
> socket connection for client /X.Y.Z.1:59576 which had sessionid 
> 0x1525a047dc20005
> 2016-01-19 13:13:11,784 - ERROR [CommitProcessor:1:NIOServerCnxn@178] - 
> Unexpected Exception:
> java.nio.channels.CancelledKeyException
>         at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>         at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>         at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>         at 
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
> 2016-01-19 13:25:43,898 - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 2 (n.leader), 0x2a001d352d (n.zxid), 0xb (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x2a (n.peerEpoch) FOLLOWING (my state)
> 2016-01-19 13:25:43,901 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] 
> - Exception when following the leader
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2016-01-19 13:25:43,901 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> Until eventually we get to:
> 2016-01-19 13:26:05,099 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumPeer@784] - FOLLOWING
> 2016-01-19 13:26:05,099 - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@162] - Created server with 
> tickTime 2000 minSessionTimeout 4000
> maxSessionTimeout 40000 datadir /var/lib/zookeeper_1/data/version-2 snapdir 
> /var/lib/zookeeper_1/data/version-2
> 2016-01-19 13:26:05,099 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@63] 
> - FOLLOWING - LEADER ELECTION TOOK - 21179
> 2016-01-19 13:26:05,100 - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Learner@233] 
> - Unexpected exception, tries=0, connecting to zoo005/X.Y.Z.71:2888
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> And things start to come right.  Right about now, the three member which had 
> so far escaped begin to exhibit the same behaviour.
> Again, if we look at the fsync messages:
> 2016-01-19 13:26:06,192 - WARN  [SyncThread:3:FileTxnLog@334] - fsync-ing the 
> write ahead log in SyncThread:3
> took 51394ms which will adversely effect operation latency.
> 2016-01-19 13:26:05,960 - WARN  [SyncThread:4:FileTxnLog@334] - fsync-ing the 
> write ahead log in SyncThread:4
> took 51162ms which will adversely effect operation latency.
> 2016-01-19 13:26:04,524 - WARN  [SyncThread:5:FileTxnLog@334] - fsync-ing the 
> write ahead log in SyncThread:5
> took 49726ms which will adversely effect operation latency.
> And we rewind the number of milliseconds from the log entry timestamp, we 
> arrive at exactly 13:25:14,798 for all three events.
> So, it looks for all the world like something entered the system at 
> 13:12:39,084, caused havoc on two nodes for 12.5 minutes, then at 
> 13:25:14,798 it got off those and made to the other three, where it again 
> caused havoc, before things eventually recovered and the world kept on 
> ticking, only a medium sized log explosion worse for it.
> There is nothing in any of the logs within a second of either of those times.
> I'm hoping someone familiar with the code can look at those stack traces and 
> understand what might cause such an incident.
> I'm to help anyway I can.  I have more complete logs, and we see this every 
> couple of weeks or so, so can setup some additional logging if it would be of 
> value.  Let me know.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to