Yes - the problem is trying to acquire the lock in response to the NodeCache 
listener. The NodeCache calls its listeners, by default, in the ZooKeeper's 
event thread. This isn't well documented, unfortunately. A simple change shows 
this. In TestDriver change this:

        mCache.getListenable().addListener(listener)

to

        mCache.getListenable().addListener(listener, 
Executors.newSingleThreadExecutor());

-Jordan

On Oct 18, 2013, at 9:12 AM, Chris Jeris <[email protected]> wrote:

> That test case shows the behavior I expect, but I'm not certain that it tests 
> the particular situation I'm talking about.  The thing that matters seems to 
> be that there is a listener that attempts to acquire the lock held by the 
> thing that triggered the listener - and in this case, the listener does not 
> acquire the lock immediately after its previous owner releases it, but only 
> when the listener's lock timeout expires.
> 
> I've updated my gist https://gist.github.com/ystael/7030926 with a 
> translation of my test case into Java, which exhibits the same behavior 
> (isomorphic output to the sample output already there).
> 
> thanks, Chris
> 
> 
> On Thu, Oct 17, 2013 at 5:39 PM, Jordan Zimmerman 
> <[email protected]> wrote:
> I only have a working knowledge of Clojure so it's hard for me to parse that 
> code. I wrote a quick test to make sure that acquire with a timeout is 
> working correctly. Here it is: https://gist.github.com/Randgalt/7032632
> 
> I believe this shows the behavior that you expect, right?
> 
> -Jordan
> 
> On Oct 17, 2013, at 1:41 PM, Chris Jeris <[email protected]> wrote:
> 
>> That's understood, and if the listener thread got the lock immediately after 
>> the main thread released it, that would make sense.  But look at the length 
>> of time that elapses:
>> 
>> MAIN THREAD: Acquired lock at 1382038570852, took 25
>> MAIN THREAD: Setting data to record loop-start-time 1382038570827
>> 1;1382038570827: Starting lock section at 1382038571094
>> MAIN THREAD: Released lock at 1382038571287, total elapsed 460
>> # Note here, a case where, though the lock was released by MAIN THREAD,
>> # the already-started NodeCacheListener ran out its entire timeout before
>> # grabbing the lock.
>> 1;1382038570827: Acquired lock at 1382038576106, took 5012
>> 1;1382038570827: Released lock at 1382038576420, total elapsed 5326
>> 
>> The NodeCacheListener thread could have acquired the lock just after 
>> 1382038571287, when the main thread releases it (the 200ms sleep elapses 
>> between when setData is called, triggering the listener, and when the main 
>> thread releases the lock).  But that isn't what happens.  4800 more 
>> milliseconds go by, and when the listener thread's lock timeout is just 
>> about up (or already up?), _then_ it's able to get the lock.  If you set the 
>> lock timeout in the example to 15 seconds instead of 5, then it waits 15 
>> seconds.
>> 
>> I don't understand the internal mechanics, but it almost seems as if 
>> whatever signal is supposed to be received by the listener thread's lock 
>> watcher to let it know that the lock is available is instead trapped behind 
>> the expiration of the timeout.  Sometimes.
>> 
>> thanks, Chris
>> 
>> 
>> On Thu, Oct 17, 2013 at 4:33 PM, Jordan Zimmerman 
>> <[email protected]> wrote:
>> InterProcessMutex has the same semantics as a JVM lock. It's re-entrant for 
>> the same thread but it will block other threads that try to acquire it. In 
>> the example in the gist, the main thread locks and then calls setData. The 
>> NodeCacheListener, from a different thread, tries to acquire the same lock 
>> as the main thread has so it will have to wait until the main thread release 
>> that lock.
>> 
>> -Jordan
>> 
>> On Oct 17, 2013, at 12:43 PM, Chris Jeris <[email protected]> wrote:
>> 
>>> The issue seems to be caused by the fact that we have changes propagated by 
>>> a NodeCache on a different node, and the NodeCacheListener's implementation 
>>> attempts to lock the same lock.  Is this an instance of the "single 
>>> threaded watcher" problem you're talking about?
>>> 
>>> Here's an example that shows the problem: 
>>> https://gist.github.com/ystael/7030926
>>> 
>>> The sample output was generated by starting a single instance of the 
>>> program.  You can see a couple of cases where, although the MAIN THREAD has 
>>> released the lock, the NodeCacheListener thread which is trying to acquire 
>>> the lock doesn't succeed until its timeout has elapsed all the way ("took 
>>> 5012").  It doesn't happen every time, but it happens a substantial 
>>> fraction of the time.
>>> 
>>> If I change the implementation of listener to wrap the body of nodeChanged 
>>> in a future, so that it executes in a separate thread, the problem goes 
>>> away.  We had thought from examining the implementation of NodeCache that 
>>> Listener bodies already execute in a separate background thread; is this 
>>> not the case?
>>> 
>>> thanks, Chris
>>> 
>>> 
>>> On Thu, Oct 17, 2013 at 11:08 AM, Jordan Zimmerman 
>>> <[email protected]> wrote:
>>> This sounds suspiciously like a misuse of a watcher somewhere. Are you 
>>> doing other things with ZooKeeper? ZooKeeper watches are single threaded. 
>>> If you have a watcher that blocks, etc. it will prevent other ZooKeeper 
>>> recipes from functioning. See this Tech Note: 
>>> https://cwiki.apache.org/confluence/display/CURATOR/TN1
>>> 
>>> Other than that, I haven't heard of this problem. If you can provide a test 
>>> case, that would help.
>>> 
>>> -Jordan
>>> 
>>> On Oct 17, 2013, at 7:47 AM, Chris Jeris <[email protected]> wrote:
>>> 
>>>> We have run into a knotty problem with InterProcessMutex, where calls to 
>>>> .acquire will expend their full timeout and then _succeed_ in acquiring 
>>>> the lock.  This generally happens when a different server process was the 
>>>> last one to hold the lock, but it does not happen every time that is the 
>>>> case.
>>>> 
>>>> The lock is a single InterProcessMutex object per server process (= server 
>>>> machine), all on a single persistent Zookeeper node name (the object being 
>>>> access controlled is a single piece of shared state, whose data is not 
>>>> itself stored in ZK).  The problem arises in the context of a test suite 
>>>> where requests to our server cluster are issued serially, so there is 
>>>> basically no competing traffic on the lock, although there is traffic on 
>>>> the ZK cluster from other applications.  The frequency of acquires on this 
>>>> lock is not excessive (order 1 per second), and we are reasonably certain 
>>>> our client code is not holding the lock longer than it should.
>>>> 
>>>> The problem does not seem to be sensitive to the exact value of the 
>>>> timeout.  If we set it to 15 seconds, we see lock acquires taking 15 
>>>> seconds and then succeeding; if we set it to 60 seconds, we see them 
>>>> taking 60 seconds and then succeeding.
>>>> 
>>>> Right now we observe the problem with Curator 2.1.0 against both ZK 3.3.6 
>>>> and 3.4.5.
>>>> 
>>>> Is this a known or familiar issue?  Does it sound like we're doing 
>>>> something wrong?
>>>> 
>>>> thanks, Chris Jeris
>>>> -- 
>>>> Chris Jeris
>>>> [email protected]
>>>> freenode/twitter/github: ystael
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Chris Jeris
>>> [email protected]
>>> freenode/twitter/github: ystael
>> 
>> 
>> 
>> 
>> -- 
>> Chris Jeris
>> [email protected]
>> freenode/twitter/github: ystael
> 
> 
> 
> 
> -- 
> Chris Jeris
> [email protected]
> freenode/twitter/github: ystael

Reply via email to