[ 
https://issues.apache.org/jira/browse/CURATOR-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729593#comment-17729593
 ] 

Kezhu Wang commented on CURATOR-678:
------------------------------------

{quote}but we don't retry forever, we set a limit for number of retries{quote}

That is ok. {{guaranteed}} is supposed to [ignore retry 
limit|https://github.com/apache/curator/blob/apache-curator-5.3.0/curator-framework/src/main/java/org/apache/curator/framework/imps/DeleteBuilderImpl.java#L252-L263].

{quote}
As a result, even when the zk connection recovered later, ALL following 
acquire() failed due to the inconsistency (not present in local `threadData` 
but the OLD zk node were still present).
{quote}

Any possibility for an reproducible test case ?

{quote}a suggestion is to remove the local data ONLY after znode deletion is a 
success.{quote}
A client "failure" could be a success in server side. This will introduce 
double-leader.

> InterProcessMutex#release caused inconsistency between zk node and local 
> cache if encountering zk connection lost
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-678
>                 URL: https://issues.apache.org/jira/browse/CURATOR-678
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 5.3.0
>            Reporter: Ken Huang
>            Assignee: Enrico Olivelli
>            Priority: Major
>
> We experienced a problem that
> an InterProcessMutex participant acquired the lock -> when release() was 
> running, it encountered zk connection lost, then there was inconsistency as 
> in codes 
> [https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessMutex.java#L139]
> to line 143, that the zk node deletion threw exception for connection lost, 
> but the local cached `threadData` still removed it.
> As a result, even when the zk connection recovered later, ALL following 
> acquire() failed due to the inconsistency (not present in local `threadData` 
> but the OLD zk node were still present).
>  
> Please help confirm this behavior. I think it is bug and curator should fix 
> the inconsistency, a suggestion is to remove the local data ONLY after znode 
> deletion is a success. Also, the same problematic code seems appearing in 
> many other similar recipes such as `InterProcessSemaphore`.
>  
> Stacktrace:
> ```
> Failed to release mutex for xxxxxxxxxxxxx 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /xxxx/_c_65fb02ef-9b1d-4c8c-b715-5c97f82ae0d3-lock-0000000000 at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102) 
> ~[zookeeper-3.6.3.jar:3.6.3] at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
> ~[zookeeper-3.6.3.jar:3.6.3] at 
> org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) 
> ~[zookeeper-3.6.3.jar:3.6.3] at 
> org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:313)
>  ~[curator-framework-5.3.0.jar:5.3.0] at 
> org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:301)
>  ~[curator-framework-5.3.0.jar:5.3.0] at 
> org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) 
> ~[curator-client-5.3.0.jar:?] at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:298)
>  ~[curator-framework-5.3.0.jar:5.3.0] at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:282)
>  ~[curator-framework-5.3.0.jar:5.3.0] at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
>  ~[curator-framework-5.3.0.jar:5.3.0] at 
> org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:347)
>  ~[curator-recipes-5.3.0.jar:5.3.0] at 
> org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:124)
>  ~[curator-recipes-5.3.0.jar:5.3.0] at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
>  ~[curator-recipes-5.3.0.jar:5.3.0] at 
> ... ...
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to