[ 
https://issues.apache.org/jira/browse/CURATOR-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788816#comment-13788816
 ] 

Doug Jones commented on CURATOR-62:
-----------------------------------

I came across this bug because I've run into it in production. Perhaps I have a 
weird use case, but what I'm essentially doing is running an election on a 
fixed schedule (say every 1 hour). When its work is finished, the leader 
signals for shutdown and then exits. There's a race condition between another 
thread trying to become the leader and handling the shutdown event that can 
result in the deadlock described here for future elections. This wouldn't be an 
issue if leaderSelector#close was completely reliable.

I can probably work around this bug, but it's definitely an issue for repeated 
elections.

> Leader Election Deadlock
> ------------------------
>
>                 Key: CURATOR-62
>                 URL: https://issues.apache.org/jira/browse/CURATOR-62
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>            Reporter: Doug Jones
>            Assignee: Jordan Zimmerman
>            Priority: Minor
>             Fix For: TBD
>
>
> I've noticed that it is possible for a leader election to deadlock if a 
> thread is interrupted while it is trying to acquire the mutex for the 
> election.
> I've created a forced example of this here: 
> https://github.com/dfjones/curator/commit/544220b1e6b51c2718a7d3511a74962ff1c5ff48
> You can see deadlock by using my modified code and running the 
> LeaderSelectorExample. Some leaders may execute, but on my system I 
> eventually see deadlock. Note that I only see deadlock when running against a 
> remote zk server rather than the embedded test server. I'm using Zookeeper 
> 3.4.5 on Mac OS X 10.8.4.
> From what I can tell by inspecting the ZK state/watching in the debugger, the 
> thread that is interrupted is able to successfully create the lock object in 
> ZK. However, due to the interrupt an exception is generated and 
> LockInternals#internalLockLoop never runs. Later, in LeaderSelector#doWork 
> when mutex.release() is called this fails at the for lockData.
> Once this occurs, the lock object in ZK is the oldest and will cause deadlock.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to