Greetings,

We just saw this problem again, and this time we were able to capture a
core file of the jvm using gdb.  I've run it through jstack and jmap to get
a heap profile.  I can see that the FollowerZookeeperServer has
a requestsInProcess member that is ~24K.  I can also see that the
CommitProcessor's queuedRequest's list has the 24K items in it, so the
FinalRequestProcessor's processRequest function isn't ever getting called
to complete the requests.

The CommitProcessor's run() is doing this:

Thread 23510: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
line=182 (Compiled frame)

Based on the state, it made it to wait() because isWaitingForCommit()==true
&& committedRequests.isEmpty()==true.

Strangely, once we detached from the jvm, it must have woken up this thread
and the queue flushed out as expected, bringing everything back to normal.

I'll keep digging, but any help or direction would be appreciated as I'm
not very familiar with this area of the codebase.

Thanks!
Jared


On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
fpjunque...@yahoo.com.invalid> wrote:

> It doesn't ring a bell, but it might be worth having a look at the logs to
> see if there is anything unusual.
>
> Just to clarify, was the number of outstanding requests growing, constant?
> I suppose the server was following/leading and operations were going
> through, otherwise it'd have dropped the connection to the leader or
> leadership.
>
> -Flavio
>
> > On 17 Feb 2015, at 18:01, Marshall McMullen <marshall.mcmul...@gmail.com>
> wrote:
> >
> > Greetings,
> >
> > We saw an issue recently that I've never seen before and am hoping I can
> > get some clarity on what may cause this and whether it's a known issue.
> We
> > had a 5 node ensemble and were unable to connect to one of the ZooKeeper
> > instances.  When trying to connect with zkCli it would timeout. When I
> > connected via telnet and issued the srvr four letter word, I was
> surprised
> > to see that this one server reported a massive number of 'Outstanding'
> > requests. I'd never seen that really be anything other than 0 before. On
> > the ZK dev guide it says:
> >
> > "outstanding is the number of queued requests, this increases when the
> > server is under load and is receiving more sustained requests than it can
> > process, ie the request queue". I looked at all the ZK servers in my
> > ensemble:
> >
> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
> > grep Outstanding; done
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 18876
> >
> > I eventually killed ZK on the affected server and everything corrected
> > itself and Outstanding went to zero and I was able to connect again.
> >
> > Is this something anyone's familiar with? I have logs if it would be
> > helpful.
> >
> > Thanks!
>
>

Reply via email to