I only have a working knowledge of Clojure so it's hard for me to parse that code. I wrote a quick test to make sure that acquire with a timeout is working correctly. Here it is: https://gist.github.com/Randgalt/7032632
I believe this shows the behavior that you expect, right? -Jordan On Oct 17, 2013, at 1:41 PM, Chris Jeris <[email protected]> wrote: > That's understood, and if the listener thread got the lock immediately after > the main thread released it, that would make sense. But look at the length > of time that elapses: > > MAIN THREAD: Acquired lock at 1382038570852, took 25 > MAIN THREAD: Setting data to record loop-start-time 1382038570827 > 1;1382038570827: Starting lock section at 1382038571094 > MAIN THREAD: Released lock at 1382038571287, total elapsed 460 > # Note here, a case where, though the lock was released by MAIN THREAD, > # the already-started NodeCacheListener ran out its entire timeout before > # grabbing the lock. > 1;1382038570827: Acquired lock at 1382038576106, took 5012 > 1;1382038570827: Released lock at 1382038576420, total elapsed 5326 > > The NodeCacheListener thread could have acquired the lock just after > 1382038571287, when the main thread releases it (the 200ms sleep elapses > between when setData is called, triggering the listener, and when the main > thread releases the lock). But that isn't what happens. 4800 more > milliseconds go by, and when the listener thread's lock timeout is just about > up (or already up?), _then_ it's able to get the lock. If you set the lock > timeout in the example to 15 seconds instead of 5, then it waits 15 seconds. > > I don't understand the internal mechanics, but it almost seems as if whatever > signal is supposed to be received by the listener thread's lock watcher to > let it know that the lock is available is instead trapped behind the > expiration of the timeout. Sometimes. > > thanks, Chris > > > On Thu, Oct 17, 2013 at 4:33 PM, Jordan Zimmerman > <[email protected]> wrote: > InterProcessMutex has the same semantics as a JVM lock. It's re-entrant for > the same thread but it will block other threads that try to acquire it. In > the example in the gist, the main thread locks and then calls setData. The > NodeCacheListener, from a different thread, tries to acquire the same lock as > the main thread has so it will have to wait until the main thread release > that lock. > > -Jordan > > On Oct 17, 2013, at 12:43 PM, Chris Jeris <[email protected]> wrote: > >> The issue seems to be caused by the fact that we have changes propagated by >> a NodeCache on a different node, and the NodeCacheListener's implementation >> attempts to lock the same lock. Is this an instance of the "single threaded >> watcher" problem you're talking about? >> >> Here's an example that shows the problem: >> https://gist.github.com/ystael/7030926 >> >> The sample output was generated by starting a single instance of the >> program. You can see a couple of cases where, although the MAIN THREAD has >> released the lock, the NodeCacheListener thread which is trying to acquire >> the lock doesn't succeed until its timeout has elapsed all the way ("took >> 5012"). It doesn't happen every time, but it happens a substantial fraction >> of the time. >> >> If I change the implementation of listener to wrap the body of nodeChanged >> in a future, so that it executes in a separate thread, the problem goes >> away. We had thought from examining the implementation of NodeCache that >> Listener bodies already execute in a separate background thread; is this not >> the case? >> >> thanks, Chris >> >> >> On Thu, Oct 17, 2013 at 11:08 AM, Jordan Zimmerman >> <[email protected]> wrote: >> This sounds suspiciously like a misuse of a watcher somewhere. Are you doing >> other things with ZooKeeper? ZooKeeper watches are single threaded. If you >> have a watcher that blocks, etc. it will prevent other ZooKeeper recipes >> from functioning. See this Tech Note: >> https://cwiki.apache.org/confluence/display/CURATOR/TN1 >> >> Other than that, I haven't heard of this problem. If you can provide a test >> case, that would help. >> >> -Jordan >> >> On Oct 17, 2013, at 7:47 AM, Chris Jeris <[email protected]> wrote: >> >>> We have run into a knotty problem with InterProcessMutex, where calls to >>> .acquire will expend their full timeout and then _succeed_ in acquiring the >>> lock. This generally happens when a different server process was the last >>> one to hold the lock, but it does not happen every time that is the case. >>> >>> The lock is a single InterProcessMutex object per server process (= server >>> machine), all on a single persistent Zookeeper node name (the object being >>> access controlled is a single piece of shared state, whose data is not >>> itself stored in ZK). The problem arises in the context of a test suite >>> where requests to our server cluster are issued serially, so there is >>> basically no competing traffic on the lock, although there is traffic on >>> the ZK cluster from other applications. The frequency of acquires on this >>> lock is not excessive (order 1 per second), and we are reasonably certain >>> our client code is not holding the lock longer than it should. >>> >>> The problem does not seem to be sensitive to the exact value of the >>> timeout. If we set it to 15 seconds, we see lock acquires taking 15 >>> seconds and then succeeding; if we set it to 60 seconds, we see them taking >>> 60 seconds and then succeeding. >>> >>> Right now we observe the problem with Curator 2.1.0 against both ZK 3.3.6 >>> and 3.4.5. >>> >>> Is this a known or familiar issue? Does it sound like we're doing >>> something wrong? >>> >>> thanks, Chris Jeris >>> -- >>> Chris Jeris >>> [email protected] >>> freenode/twitter/github: ystael >> >> >> >> >> -- >> Chris Jeris >> [email protected] >> freenode/twitter/github: ystael > > > > > -- > Chris Jeris > [email protected] (617) 686-3271 > freenode/twitter/github: ystael
