I only have a working knowledge of Clojure so it's hard for me to parse that 
code. I wrote a quick test to make sure that acquire with a timeout is working 
correctly. Here it is: https://gist.github.com/Randgalt/7032632

I believe this shows the behavior that you expect, right?

-Jordan

On Oct 17, 2013, at 1:41 PM, Chris Jeris <[email protected]> wrote:

> That's understood, and if the listener thread got the lock immediately after 
> the main thread released it, that would make sense.  But look at the length 
> of time that elapses:
> 
> MAIN THREAD: Acquired lock at 1382038570852, took 25
> MAIN THREAD: Setting data to record loop-start-time 1382038570827
> 1;1382038570827: Starting lock section at 1382038571094
> MAIN THREAD: Released lock at 1382038571287, total elapsed 460
> # Note here, a case where, though the lock was released by MAIN THREAD,
> # the already-started NodeCacheListener ran out its entire timeout before
> # grabbing the lock.
> 1;1382038570827: Acquired lock at 1382038576106, took 5012
> 1;1382038570827: Released lock at 1382038576420, total elapsed 5326
> 
> The NodeCacheListener thread could have acquired the lock just after 
> 1382038571287, when the main thread releases it (the 200ms sleep elapses 
> between when setData is called, triggering the listener, and when the main 
> thread releases the lock).  But that isn't what happens.  4800 more 
> milliseconds go by, and when the listener thread's lock timeout is just about 
> up (or already up?), _then_ it's able to get the lock.  If you set the lock 
> timeout in the example to 15 seconds instead of 5, then it waits 15 seconds.
> 
> I don't understand the internal mechanics, but it almost seems as if whatever 
> signal is supposed to be received by the listener thread's lock watcher to 
> let it know that the lock is available is instead trapped behind the 
> expiration of the timeout.  Sometimes.
> 
> thanks, Chris
> 
> 
> On Thu, Oct 17, 2013 at 4:33 PM, Jordan Zimmerman 
> <[email protected]> wrote:
> InterProcessMutex has the same semantics as a JVM lock. It's re-entrant for 
> the same thread but it will block other threads that try to acquire it. In 
> the example in the gist, the main thread locks and then calls setData. The 
> NodeCacheListener, from a different thread, tries to acquire the same lock as 
> the main thread has so it will have to wait until the main thread release 
> that lock.
> 
> -Jordan
> 
> On Oct 17, 2013, at 12:43 PM, Chris Jeris <[email protected]> wrote:
> 
>> The issue seems to be caused by the fact that we have changes propagated by 
>> a NodeCache on a different node, and the NodeCacheListener's implementation 
>> attempts to lock the same lock.  Is this an instance of the "single threaded 
>> watcher" problem you're talking about?
>> 
>> Here's an example that shows the problem: 
>> https://gist.github.com/ystael/7030926
>> 
>> The sample output was generated by starting a single instance of the 
>> program.  You can see a couple of cases where, although the MAIN THREAD has 
>> released the lock, the NodeCacheListener thread which is trying to acquire 
>> the lock doesn't succeed until its timeout has elapsed all the way ("took 
>> 5012").  It doesn't happen every time, but it happens a substantial fraction 
>> of the time.
>> 
>> If I change the implementation of listener to wrap the body of nodeChanged 
>> in a future, so that it executes in a separate thread, the problem goes 
>> away.  We had thought from examining the implementation of NodeCache that 
>> Listener bodies already execute in a separate background thread; is this not 
>> the case?
>> 
>> thanks, Chris
>> 
>> 
>> On Thu, Oct 17, 2013 at 11:08 AM, Jordan Zimmerman 
>> <[email protected]> wrote:
>> This sounds suspiciously like a misuse of a watcher somewhere. Are you doing 
>> other things with ZooKeeper? ZooKeeper watches are single threaded. If you 
>> have a watcher that blocks, etc. it will prevent other ZooKeeper recipes 
>> from functioning. See this Tech Note: 
>> https://cwiki.apache.org/confluence/display/CURATOR/TN1
>> 
>> Other than that, I haven't heard of this problem. If you can provide a test 
>> case, that would help.
>> 
>> -Jordan
>> 
>> On Oct 17, 2013, at 7:47 AM, Chris Jeris <[email protected]> wrote:
>> 
>>> We have run into a knotty problem with InterProcessMutex, where calls to 
>>> .acquire will expend their full timeout and then _succeed_ in acquiring the 
>>> lock.  This generally happens when a different server process was the last 
>>> one to hold the lock, but it does not happen every time that is the case.
>>> 
>>> The lock is a single InterProcessMutex object per server process (= server 
>>> machine), all on a single persistent Zookeeper node name (the object being 
>>> access controlled is a single piece of shared state, whose data is not 
>>> itself stored in ZK).  The problem arises in the context of a test suite 
>>> where requests to our server cluster are issued serially, so there is 
>>> basically no competing traffic on the lock, although there is traffic on 
>>> the ZK cluster from other applications.  The frequency of acquires on this 
>>> lock is not excessive (order 1 per second), and we are reasonably certain 
>>> our client code is not holding the lock longer than it should.
>>> 
>>> The problem does not seem to be sensitive to the exact value of the 
>>> timeout.  If we set it to 15 seconds, we see lock acquires taking 15 
>>> seconds and then succeeding; if we set it to 60 seconds, we see them taking 
>>> 60 seconds and then succeeding.
>>> 
>>> Right now we observe the problem with Curator 2.1.0 against both ZK 3.3.6 
>>> and 3.4.5.
>>> 
>>> Is this a known or familiar issue?  Does it sound like we're doing 
>>> something wrong?
>>> 
>>> thanks, Chris Jeris
>>> -- 
>>> Chris Jeris
>>> [email protected]
>>> freenode/twitter/github: ystael
>> 
>> 
>> 
>> 
>> -- 
>> Chris Jeris
>> [email protected]
>> freenode/twitter/github: ystael
> 
> 
> 
> 
> -- 
> Chris Jeris
> [email protected] (617) 686-3271
> freenode/twitter/github: ystael

Reply via email to