Thank you, will check On Thu, Jan 25, 2018 at 11:10 AM, Andor Molnar <an...@cloudera.com> wrote:
> Use EBS drives and make sure you allocate enough IOPS for the load. > > Andor > > > On Thu, Jan 25, 2018 at 4:21 PM, upendar devu <devulapal...@gmail.com> > wrote: > > > a disk write has taken too long as well: I will check on this, thanks > for > > finding it. zk logs really bit diff to understand for me. > > > > On Thu, Jan 25, 2018 at 10:19 AM, upendar devu <devulapal...@gmail.com> > > wrote: > > > > > Thanks for sharing analysis , the instances running on EC2 instances > and > > > we have kafka,zk,storm and es instances as well but not seen such error > > in > > > those components if there is network latency then there should be > socket > > > error in other components as data is being processed every sec. > > > > > > Lets hear from zookeeper dev team , hope they will respond > > > > > > On Thu, Jan 25, 2018 at 6:39 AM, Andor Molnar <an...@cloudera.com> > > wrote: > > > > > >> No, this is not the bug I was thinking of. > > >> > > >> Looks like your network connection is poor between the leader and the > > >> follower which the logs was attached. Do you have any other network > > >> monitoring tools in place or do you see any network related error > > messages > > >> in your kernel logs? > > >> Follower lost the connection to the leader: > > >> 2018-01-23 07:40:21,709 [myid:3] - WARN > > >> [SyncThread:3:SendAckRequestProcessor@64] - Closing connection to > > leader, > > >> exception during packet send > > >> > > >> ...and took ages to recover: 944 secs!! > > >> 2018-01-23 07:56:05,742 [myid:3] - INFO > > >> [QuorumPeer[myid=3]/XX.XX.XX:2181:Follower@63] - FOLLOWING - LEADER > > >> ELECTION TOOK - 944020 > > >> > > >> Additionally, a disk write has taken too long as well: > > >> 2018-01-23 07:40:21,706 [myid:3] - WARN [SyncThread:3:FileTxnLog@334 > ] > > - > > >> fsync-ing the write ahead log in SyncThread:3 took 13638ms which will > > >> adversely effect operation latency. See the ZooKeeper troubleshooting > > >> guide > > >> > > >> I believe this stuff is worth to take a closer look, though I'm not an > > >> expert of Zookeeper, maybe somebody else can give you more insight. > > >> > > >> Regards, > > >> Andor > > >> > > >> > > >> On Wed, Jan 24, 2018 at 7:47 PM, upendar devu <devulapal...@gmail.com > > > > >> wrote: > > >> > > >> > Thanks Andor for the reply. > > >> > > > >> > We are using zookeeper version 3.4.6; we have 3 instances ; please > see > > >> > below configuration , I believe we are using default configuration > and > > >> > attached zk log and issue is occurred at First Occurrence: > 01/23/2018 > > >> > 07:42:22 Last Occurrence: 01/23/2018 07:43:22 > > >> > > > >> > > > >> > The issue occurs 3 to 4 times in a month and get auto resolved in > few > > >> mins > > >> > but this is really annoying our operations team. please let me know > if > > >> you > > >> > need any additional details > > >> > > > >> > > > >> > > > >> > # The number of milliseconds of each tick > > >> > tickTime=2000 > > >> > > > >> > # The number of ticks that the initial synchronization phase can > take > > >> > initLimit=10 > > >> > > > >> > # The number of ticks that can pass between sending a request and > > >> getting > > >> > an acknowledgement > > >> > syncLimit=5 > > >> > > > >> > # The directory where the snapshot is stored. > > >> > dataDir=/opt/zookeeper/current/data > > >> > > > >> > # The port at which the clients will connect > > >> > clientPort=2181 > > >> > > > >> > # This is the list of Zookeeper peers: > > >> > server.1=zookeeper1:2888:3888 > > >> > server.2=zookeeper2:2888:3888 > > >> > server.3=zookeeper3:2888:3888 > > >> > > > >> > # The interface IP address(es) from which zookeeper will listen from > > >> > clientPortAddress=<IP of zk> > > >> > > > >> > # The number of snapshots to retain in dataDir > > >> > autopurge.snapRetainCount=3 > > >> > > > >> > # Purge task interval in hours > > >> > # Set to "0" to disable auto purge feature > > >> > autopurge.purgeInterval=1 > > >> > > > >> > > > >> > On Wed, Jan 24, 2018 at 4:51 AM, Andor Molnar <an...@cloudera.com> > > >> wrote: > > >> > > > >> >> Hi Upendar, > > >> >> > > >> >> Thanks for reporting the issue. > > >> >> I've a gut feeling which existing bug you've run into, but would > you > > >> >> please > > >> >> share some more detail (version of ZK, log context, config files, > > >> etc.) to > > >> >> get confidence? > > >> >> > > >> >> Thanks, > > >> >> Andor > > >> >> > > >> >> > > >> >> On Wed, Jan 17, 2018 at 4:36 PM, upendar devu < > > devulapal...@gmail.com> > > >> >> wrote: > > >> >> > > >> >> > we are getting below error twice in a month , though its auto > > >> resolved > > >> >> but > > >> >> > anyone can explain why this error occurring and what needs to be > > >> done to > > >> >> > prevent the error , is this common error and can be ignored? > > >> >> > > > >> >> > Please suggest. > > >> >> > > > >> >> > > > >> >> > 2018-01-16 20:36:17,378 [myid:2] - WARN > > >> >> > [RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection > broken > > >> for > > >> >> id > > >> >> > 3, my id = 2, error = java.net.SocketException: Socket closed at > > >> >> > java.net.SocketInputStream.socketRead0(Native Method) at > > >> >> > java.net.SocketInputStream.socketRead(SocketInputStream. > java:116) > > at > > >> >> > java.net.SocketInputStream.read(SocketInputStream.java:171) at > > >> >> > java.net.SocketInputStream.read(SocketInputStream.java:141) at > > >> >> > java.net.SocketInputStream.read(SocketInputStream.java:224) at > > >> >> > java.io.DataInputStream.readInt(DataInputStream.java:387) at > > >> >> > org.apache.zookeeper.server.quorum.QuorumCnxManager$ > > RecvWorker.run( > > >> >> > QuorumCnxManager.java:765) > > >> >> > > > >> >> > > >> > > > >> > > > >> > > > > > > > > >