I've confirmed that no time changing was happening before things got stuck
2 days before we noticed it.

I also just noticed that the FollowerZookeeperServer, who should be calling
commit() on the CommitProcessor, has no elements in its pendingTxns list,
which indicates that it thinks it has already passed a COMMIT message to
the CommitProcessor for every request that is stuck in the queuedRequests
list and nextPending member of CommitProcessor.

I guess I should open a JIRA for this at this point?

~Jared


On Tue, Mar 24, 2015 at 8:49 AM, Jared Cantwell <jared.cantw...@gmail.com>
wrote:

> I do not see any evidence of a time jump or date change on this node
> during recently.  I will continue to investigate.
>
> ~Jared
>
> On Mon, Mar 23, 2015 at 6:42 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> Not this, right?
>>
>> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441
>> http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html
>>
>> https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/
>>
>> Patrick
>>
>>
>> On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell
>> <jared.cantw...@gmail.com> wrote:
>> > Greetings,
>> >
>> > We just saw this problem again, and this time we were able to capture a
>> > core file of the jvm using gdb.  I've run it through jstack and jmap to
>> get
>> > a heap profile.  I can see that the FollowerZookeeperServer has
>> > a requestsInProcess member that is ~24K.  I can also see that the
>> > CommitProcessor's queuedRequest's list has the 24K items in it, so the
>> > FinalRequestProcessor's processRequest function isn't ever getting
>> called
>> > to complete the requests.
>> >
>> > The CommitProcessor's run() is doing this:
>> >
>> > Thread 23510: (state = BLOCKED)
>> >  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> be
>> > imprecise)
>> >  - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
>> > line=182 (Compiled frame)
>> >
>> > Based on the state, it made it to wait() because
>> isWaitingForCommit()==true
>> > && committedRequests.isEmpty()==true.
>> >
>> > Strangely, once we detached from the jvm, it must have woken up this
>> thread
>> > and the queue flushed out as expected, bringing everything back to
>> normal.
>> >
>> > I'll keep digging, but any help or direction would be appreciated as I'm
>> > not very familiar with this area of the codebase.
>> >
>> > Thanks!
>> > Jared
>> >
>> >
>> > On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
>> > fpjunque...@yahoo.com.invalid> wrote:
>> >
>> >> It doesn't ring a bell, but it might be worth having a look at the
>> logs to
>> >> see if there is anything unusual.
>> >>
>> >> Just to clarify, was the number of outstanding requests growing,
>> constant?
>> >> I suppose the server was following/leading and operations were going
>> >> through, otherwise it'd have dropped the connection to the leader or
>> >> leadership.
>> >>
>> >> -Flavio
>> >>
>> >> > On 17 Feb 2015, at 18:01, Marshall McMullen <
>> marshall.mcmul...@gmail.com>
>> >> wrote:
>> >> >
>> >> > Greetings,
>> >> >
>> >> > We saw an issue recently that I've never seen before and am hoping I
>> can
>> >> > get some clarity on what may cause this and whether it's a known
>> issue.
>> >> We
>> >> > had a 5 node ensemble and were unable to connect to one of the
>> ZooKeeper
>> >> > instances.  When trying to connect with zkCli it would timeout. When
>> I
>> >> > connected via telnet and issued the srvr four letter word, I was
>> >> surprised
>> >> > to see that this one server reported a massive number of
>> 'Outstanding'
>> >> > requests. I'd never seen that really be anything other than 0
>> before. On
>> >> > the ZK dev guide it says:
>> >> >
>> >> > "outstanding is the number of queued requests, this increases when
>> the
>> >> > server is under load and is receiving more sustained requests than
>> it can
>> >> > process, ie the request queue". I looked at all the ZK servers in my
>> >> > ensemble:
>> >> >
>> >> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip}
>> 2181 |
>> >> > grep Outstanding; done
>> >> > Outstanding: 0
>> >> > Outstanding: 0
>> >> > Outstanding: 0
>> >> > Outstanding: 0
>> >> > Outstanding: 18876
>> >> >
>> >> > I eventually killed ZK on the affected server and everything
>> corrected
>> >> > itself and Outstanding went to zero and I was able to connect again.
>> >> >
>> >> > Is this something anyone's familiar with? I have logs if it would be
>> >> > helpful.
>> >> >
>> >> > Thanks!
>> >>
>> >>
>>
>
>

Reply via email to