Dang. Didn't save the log. Pardon me. I pasted exceptions only and thought it all about 0x26ed968d880001 session but now I see that what I posted above has TIMED_OUT on another session altogether. Above I skipped pasting exceptions thinking them on the same session but now it seems they probably were not.
I'm trying to track a case where zk seems of a sudden, client-side, to give up the ghost w/ exceptions like those pasted above -- connectivity probs. There has been pollution in here where long gc pauses that are > session timeout would trigger TIMED_OUT but those have been tamed. I'll be back if I get another instance on my hook. Meantime, thanks for the comments. St.Ack On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar <maha...@yahoo-inc.com> wrote: > HI stack, > the other interesting part is with the session: > 0x26ed968d880001 > > Looks like it gets disconnected from one of the servers (TIMEOUT). DO you > see any of these messages: "Attempting connection to server" in the logs > before you see all the consecutive > > org.apache.zookeeper.ClientCnxn: Exception closing session > 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 > java.io.IOException: Read error rc = -1 > java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] > at > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > and.... > > > From the cient 0x26ed968d880001? > > Thanks > mahadev > > > On 2/22/10 11:42 AM, "Stack" <st...@duboce.net> wrote: > >> The thing that seems odd to me is that the connectivity complaints are >> out of the zk client, right?, why is it failing getting to member 14 >> and why not move to another ensemble member if issue w/ 14?, and if >> there were a general connectivity issue, I'd think that the running >> hbase cluster would be complaining at about the same time (its talking >> to datanodes and masters at this time). >> >> (Thanks for the input lads) >> >> St.Ack >> >> >> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar <maha...@yahoo-inc.com> >> wrote: >>> I also looked at the logs. Ted might have a point. It does look like that >>> zookeeper server's are doing fine (though as ted mentions the skew is a >>> little concerning, though that might be due to very few packets served by >>> the first server). Other than that the latencies of 300 ms at max should not >>> cause any timeouts. >>> Also, the number of packets received is pretty low - meaning that it wasn't >>> serving huge traffic. Is there anyway we can check if the network connection >>> from the client to the server is not flaky? >>> >>> Thanks >>> mahadev >>> >>> >>> On 2/22/10 10:40 AM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: >>> >>>> Not sure this helps at all, but these times are remarkably asymmetrical. I >>>> would expect members of a ZK cluster to have very comparable times. >>>> >>>> Additionally, 345 ms is nowhere near large enough to cause a session to >>>> expire. My take is that ZK doesn't think it caused the timeout. >>>> >>>> On Mon, Feb 22, 2010 at 10:18 AM, Stack <st...@duboce.net> wrote: >>>> >>>>> Latency min/avg/max: 2/125/345 >>>>> ... >>>>> Latency min/avg/max: 0/7/81 >>>>> ... >>>>> Latency min/avg/max: 1/1/1 >>>>> >>>>> Thanks for any pointers on how to debug. >>>>> >>> >>> > >