HI stack, the other interesting part is with the session: 0x26ed968d880001
Looks like it gets disconnected from one of the servers (TIMEOUT). DO you see any of these messages: "Attempting connection to server" in the logs before you see all the consecutive org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) and.... >From the cient 0x26ed968d880001? Thanks mahadev On 2/22/10 11:42 AM, "Stack" <st...@duboce.net> wrote: > The thing that seems odd to me is that the connectivity complaints are > out of the zk client, right?, why is it failing getting to member 14 > and why not move to another ensemble member if issue w/ 14?, and if > there were a general connectivity issue, I'd think that the running > hbase cluster would be complaining at about the same time (its talking > to datanodes and masters at this time). > > (Thanks for the input lads) > > St.Ack > > > On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar <maha...@yahoo-inc.com> wrote: >> I also looked at the logs. Ted might have a point. It does look like that >> zookeeper server's are doing fine (though as ted mentions the skew is a >> little concerning, though that might be due to very few packets served by >> the first server). Other than that the latencies of 300 ms at max should not >> cause any timeouts. >> Also, the number of packets received is pretty low - meaning that it wasn't >> serving huge traffic. Is there anyway we can check if the network connection >> from the client to the server is not flaky? >> >> Thanks >> mahadev >> >> >> On 2/22/10 10:40 AM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: >> >>> Not sure this helps at all, but these times are remarkably asymmetrical. I >>> would expect members of a ZK cluster to have very comparable times. >>> >>> Additionally, 345 ms is nowhere near large enough to cause a session to >>> expire. My take is that ZK doesn't think it caused the timeout. >>> >>> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote: >>> >>>> Latency min/avg/max: 2/125/345 >>>> ... >>>> Latency min/avg/max: 0/7/81 >>>> ... >>>> Latency min/avg/max: 1/1/1 >>>> >>>> Thanks for any pointers on how to debug. >>>> >> >>