I've confirmed that no time changing was happening before things got stuck 2 days before we noticed it.
I also just noticed that the FollowerZookeeperServer, who should be calling commit() on the CommitProcessor, has no elements in its pendingTxns list, which indicates that it thinks it has already passed a COMMIT message to the CommitProcessor for every request that is stuck in the queuedRequests list and nextPending member of CommitProcessor. I guess I should open a JIRA for this at this point? ~Jared On Tue, Mar 24, 2015 at 8:49 AM, Jared Cantwell <jared.cantw...@gmail.com> wrote: > I do not see any evidence of a time jump or date change on this node > during recently. I will continue to investigate. > > ~Jared > > On Mon, Mar 23, 2015 at 6:42 PM, Patrick Hunt <ph...@apache.org> wrote: > >> Not this, right? >> >> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441 >> http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html >> >> https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/ >> >> Patrick >> >> >> On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell >> <jared.cantw...@gmail.com> wrote: >> > Greetings, >> > >> > We just saw this problem again, and this time we were able to capture a >> > core file of the jvm using gdb. I've run it through jstack and jmap to >> get >> > a heap profile. I can see that the FollowerZookeeperServer has >> > a requestsInProcess member that is ~24K. I can also see that the >> > CommitProcessor's queuedRequest's list has the 24K items in it, so the >> > FinalRequestProcessor's processRequest function isn't ever getting >> called >> > to complete the requests. >> > >> > The CommitProcessor's run() is doing this: >> > >> > Thread 23510: (state = BLOCKED) >> > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may >> be >> > imprecise) >> > - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165, >> > line=182 (Compiled frame) >> > >> > Based on the state, it made it to wait() because >> isWaitingForCommit()==true >> > && committedRequests.isEmpty()==true. >> > >> > Strangely, once we detached from the jvm, it must have woken up this >> thread >> > and the queue flushed out as expected, bringing everything back to >> normal. >> > >> > I'll keep digging, but any help or direction would be appreciated as I'm >> > not very familiar with this area of the codebase. >> > >> > Thanks! >> > Jared >> > >> > >> > On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira < >> > fpjunque...@yahoo.com.invalid> wrote: >> > >> >> It doesn't ring a bell, but it might be worth having a look at the >> logs to >> >> see if there is anything unusual. >> >> >> >> Just to clarify, was the number of outstanding requests growing, >> constant? >> >> I suppose the server was following/leading and operations were going >> >> through, otherwise it'd have dropped the connection to the leader or >> >> leadership. >> >> >> >> -Flavio >> >> >> >> > On 17 Feb 2015, at 18:01, Marshall McMullen < >> marshall.mcmul...@gmail.com> >> >> wrote: >> >> > >> >> > Greetings, >> >> > >> >> > We saw an issue recently that I've never seen before and am hoping I >> can >> >> > get some clarity on what may cause this and whether it's a known >> issue. >> >> We >> >> > had a 5 node ensemble and were unable to connect to one of the >> ZooKeeper >> >> > instances. When trying to connect with zkCli it would timeout. When >> I >> >> > connected via telnet and issued the srvr four letter word, I was >> >> surprised >> >> > to see that this one server reported a massive number of >> 'Outstanding' >> >> > requests. I'd never seen that really be anything other than 0 >> before. On >> >> > the ZK dev guide it says: >> >> > >> >> > "outstanding is the number of queued requests, this increases when >> the >> >> > server is under load and is receiving more sustained requests than >> it can >> >> > process, ie the request queue". I looked at all the ZK servers in my >> >> > ensemble: >> >> > >> >> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} >> 2181 | >> >> > grep Outstanding; done >> >> > Outstanding: 0 >> >> > Outstanding: 0 >> >> > Outstanding: 0 >> >> > Outstanding: 0 >> >> > Outstanding: 18876 >> >> > >> >> > I eventually killed ZK on the affected server and everything >> corrected >> >> > itself and Outstanding went to zero and I was able to connect again. >> >> > >> >> > Is this something anyone's familiar with? I have logs if it would be >> >> > helpful. >> >> > >> >> > Thanks! >> >> >> >> >> > >