Sounds good. Take a look at the gc options in jstat, -gcutil for example will give you insight into the GC activity of the jvm when you don't have gc logging turned on.
Patrick On Fri, Nov 11, 2011 at 10:15 AM, kishore g <[email protected]> wrote: > Hi Pat, > > We are already working on that, it was because of deploying zk along with > other applications. We will have separate boxes for ZK very soon. > > I already saw the stats and could not co relate with spikes, unfortunately > we dint have gc logs. There were few deployment rules that were clearly > violated and we will fix them. > > Good news is that we found an issue :-).Thanks again for your help. > > thanks, > Kishore G > > On Fri, Nov 11, 2011 at 9:47 AM, Patrick Hunt <[email protected]> wrote: > >> We always triage new issues as they come in (same when 1208 originally >> came in). However our ability to determine the cause is often bounded >> by the information provided by the user, which in this recent update >> was excellent and pointed out exactly the flaw. Kudos. >> >> We'll button up this issue (1208) for 3.3/3.4/trunk. After which I'll >> work on cutting a new 3.3.4 that includes this and some other fixes. >> It would be good if you could test this patch in the mean time. >> >> Also, this is particularly worrisome for me, you reported the >> following for your servers: >> >> Latency min/avg/max: 0/53/44712 >> Latency min/avg/max: 0/11/11727 >> Latency min/avg/max: 0/12/11994 >> Latency min/avg/max: 0/9/11707 >> >> That's really really terrible performance and you won't be happy with >> ZK as a result. You need to followup with your ops team to determine >> why the performance you are seeing is so terrible. Both in terms of >> average and max latency. >> >> Spikes in max latency is usually due to GC, swap, or bad disk >> performance for the WAL. >> Bad average latency might indicate poor network performance, or again >> bad disk performance. >> >> Try turning on CMS/parallelGC. Also try using iostat and look at the >> await times you're seeing for the WAL disk (correlate that with spikes >> in max latency, those counters can be reset using a 4letterword). >> >> Regards, >> >> Patrick >> >> On Fri, Nov 11, 2011 at 7:21 AM, Neha Narkhede <[email protected]> >> wrote: >> > Pat, >> > >> > That is excellent turnaround ! I will take a look at the running the test >> > as well as your patch. Will be a good opportunity for me to start >> > understanding the zookeeper codebase. >> > >> > Thanks again, >> > Neha >> > >> > On Thursday, November 10, 2011, Patrick Hunt <[email protected]> wrote: >> >> Ok, patch posted that fixes this (1208). Committers please take a look. >> >> >> >> Neha you might want to give a patched version a try. Awesome job >> >> helping to document and track down this issue. Thanks! >> >> >> >> Patrick >> >> >> >> On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[email protected]> wrote: >> >>> See my update to 1208 for a test that demonstrates this. >> >>> >> >>> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede < >> [email protected]> >> > wrote: >> >>>> Thanks Patrick for looking into this issue ! >> >>>> >> >>>>>> The logs would indicate if an election happens. Look for "LOOKING" >> or >> >>>> "LEADING" or "FOLLOWING". >> >>>> >> >>>> The logs don't have any such entries. So I'm guessing there was no >> > election >> >>>> happening. >> >>>> >> >>>> Do you have thoughts, though, on how easy it would be to reproduce >> this >> >>>> bug, to verify the bug fix ? >> >>>> >> >>>> Thanks, >> >>>> Neha >> >>>> >> >>>> >> >>>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[email protected]> >> wrote: >> >>>> >> >>>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede < >> [email protected] >> >> >> >>>>> wrote: >> >>>>> > Thanks for the quick responses, guys! Please find my replies >> inline - >> >>>>> > >> >>>>> >>> 1) Why is the session closed, the client closed it or the cluster >> >>>>> > expired it? >> >>>>> > Cluster expired it. >> >>>>> > >> >>>>> >> >>>>> Yes, I realized after that the cxid is 0 in your logs - that >> indicates >> >>>>> it was expired and not closed explicitly by the client. >> >>>>> >> >>>>> >> >>>>> >>> 3) the znode exists on all 4 servers, is that right? >> >>>>> > Yes >> >>>>> > >> >>>>> >> >>>>> This holds up my theory that the PrepRequestProcessor is accepting a >> >>>>> create from the client after the session has been expired. >> >>>>> >> >>>>> >>> 5) why are your max latencies, as well as avg latency, so high? >> >>>>> >>> a) are these dedicated boxes, not virtualized, correct? >> >>>>> > these are dedicated boxes, but zk is currently co-located with >> > kafka, but >> >>>>> > on different disks >> >>>>> > >> >>>>> >>> b) is the jvm going into gc pause? (try turning on verbose >> > logging, or >> >>>>> > use "jstat" with the gc options to see the history if you still >> have >> >>>>> > those jvms running) >> >>>>> > I don't believe we had gc logs on these machines. So its unclear. >> >>>>> > >> >>>>> >>> d) do you have dedicated spindles for the ZK WAL? If not another >> >>>>> > process might be causing the fsyncs to pause. (you can use iostat >> or >> >>>>> > strace to monitor this) >> >>>>> > No. The log4j and zk txn logs share the same disks. >> >>>>> > >> >>>>> >>> Is that the log from the server that's got the 44sec max latency? >> >>>>> > Yes. >> >>>>> > >> >>>>> >>> This is 3.3.3 ? >> >>>>> > Yes. >> >>>>> > >> >>>>> >>> was there any instability in the quorum itself during this time >> >>>>> > period? >> >>>>> > How do I find that out ? >> >>>>> >> >>>>> The logs would indicate if an election happens. Look for "LOOKING" or >> >>>>> "LEADING" or "FOLLOWING". >> >>>>> >> >>>>> >> >>>>> Your comments are consistent with my theory. Seems like a bug in PRP >> >>>>> session validation to me. >> >>>>> >> >>>>> Patrick >> >>>>> >> >>>> >> >>> >> >> >> > >> >
