I do not see any evidence of a time jump or date change on this node during recently. I will continue to investigate.
~Jared On Mon, Mar 23, 2015 at 6:42 PM, Patrick Hunt <ph...@apache.org> wrote: > Not this, right? > > http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441 > http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html > > https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/ > > Patrick > > > On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell > <jared.cantw...@gmail.com> wrote: > > Greetings, > > > > We just saw this problem again, and this time we were able to capture a > > core file of the jvm using gdb. I've run it through jstack and jmap to > get > > a heap profile. I can see that the FollowerZookeeperServer has > > a requestsInProcess member that is ~24K. I can also see that the > > CommitProcessor's queuedRequest's list has the 24K items in it, so the > > FinalRequestProcessor's processRequest function isn't ever getting called > > to complete the requests. > > > > The CommitProcessor's run() is doing this: > > > > Thread 23510: (state = BLOCKED) > > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > > imprecise) > > - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165, > > line=182 (Compiled frame) > > > > Based on the state, it made it to wait() because > isWaitingForCommit()==true > > && committedRequests.isEmpty()==true. > > > > Strangely, once we detached from the jvm, it must have woken up this > thread > > and the queue flushed out as expected, bringing everything back to > normal. > > > > I'll keep digging, but any help or direction would be appreciated as I'm > > not very familiar with this area of the codebase. > > > > Thanks! > > Jared > > > > > > On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira < > > fpjunque...@yahoo.com.invalid> wrote: > > > >> It doesn't ring a bell, but it might be worth having a look at the logs > to > >> see if there is anything unusual. > >> > >> Just to clarify, was the number of outstanding requests growing, > constant? > >> I suppose the server was following/leading and operations were going > >> through, otherwise it'd have dropped the connection to the leader or > >> leadership. > >> > >> -Flavio > >> > >> > On 17 Feb 2015, at 18:01, Marshall McMullen < > marshall.mcmul...@gmail.com> > >> wrote: > >> > > >> > Greetings, > >> > > >> > We saw an issue recently that I've never seen before and am hoping I > can > >> > get some clarity on what may cause this and whether it's a known > issue. > >> We > >> > had a 5 node ensemble and were unable to connect to one of the > ZooKeeper > >> > instances. When trying to connect with zkCli it would timeout. When I > >> > connected via telnet and issued the srvr four letter word, I was > >> surprised > >> > to see that this one server reported a massive number of 'Outstanding' > >> > requests. I'd never seen that really be anything other than 0 before. > On > >> > the ZK dev guide it says: > >> > > >> > "outstanding is the number of queued requests, this increases when the > >> > server is under load and is receiving more sustained requests than it > can > >> > process, ie the request queue". I looked at all the ZK servers in my > >> > ensemble: > >> > > >> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 > | > >> > grep Outstanding; done > >> > Outstanding: 0 > >> > Outstanding: 0 > >> > Outstanding: 0 > >> > Outstanding: 0 > >> > Outstanding: 18876 > >> > > >> > I eventually killed ZK on the affected server and everything corrected > >> > itself and Outstanding went to zero and I was able to connect again. > >> > > >> > Is this something anyone's familiar with? I have logs if it would be > >> > helpful. > >> > > >> > Thanks! > >> > >> >