[ https://issues.apache.org/jira/browse/CURATOR-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729593#comment-17729593 ]
Kezhu Wang commented on CURATOR-678: ------------------------------------ {quote}but we don't retry forever, we set a limit for number of retries{quote} That is ok. {{guaranteed}} is supposed to [ignore retry limit|https://github.com/apache/curator/blob/apache-curator-5.3.0/curator-framework/src/main/java/org/apache/curator/framework/imps/DeleteBuilderImpl.java#L252-L263]. {quote} As a result, even when the zk connection recovered later, ALL following acquire() failed due to the inconsistency (not present in local `threadData` but the OLD zk node were still present). {quote} Any possibility for an reproducible test case ? {quote}a suggestion is to remove the local data ONLY after znode deletion is a success.{quote} A client "failure" could be a success in server side. This will introduce double-leader. > InterProcessMutex#release caused inconsistency between zk node and local > cache if encountering zk connection lost > ----------------------------------------------------------------------------------------------------------------- > > Key: CURATOR-678 > URL: https://issues.apache.org/jira/browse/CURATOR-678 > Project: Apache Curator > Issue Type: Bug > Affects Versions: 5.3.0 > Reporter: Ken Huang > Assignee: Enrico Olivelli > Priority: Major > > We experienced a problem that > an InterProcessMutex participant acquired the lock -> when release() was > running, it encountered zk connection lost, then there was inconsistency as > in codes > [https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessMutex.java#L139] > to line 143, that the zk node deletion threw exception for connection lost, > but the local cached `threadData` still removed it. > As a result, even when the zk connection recovered later, ALL following > acquire() failed due to the inconsistency (not present in local `threadData` > but the OLD zk node were still present). > > Please help confirm this behavior. I think it is bug and curator should fix > the inconsistency, a suggestion is to remove the local data ONLY after znode > deletion is a success. Also, the same problematic code seems appearing in > many other similar recipes such as `InterProcessSemaphore`. > > Stacktrace: > ``` > Failed to release mutex for xxxxxxxxxxxxx > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /xxxx/_c_65fb02ef-9b1d-4c8c-b715-5c97f82ae0d3-lock-0000000000 at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > ~[zookeeper-3.6.3.jar:3.6.3] at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > ~[zookeeper-3.6.3.jar:3.6.3] at > org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) > ~[zookeeper-3.6.3.jar:3.6.3] at > org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:313) > ~[curator-framework-5.3.0.jar:5.3.0] at > org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:301) > ~[curator-framework-5.3.0.jar:5.3.0] at > org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) > ~[curator-client-5.3.0.jar:?] at > org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:298) > ~[curator-framework-5.3.0.jar:5.3.0] at > org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:282) > ~[curator-framework-5.3.0.jar:5.3.0] at > org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35) > ~[curator-framework-5.3.0.jar:5.3.0] at > org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:347) > ~[curator-recipes-5.3.0.jar:5.3.0] at > org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:124) > ~[curator-recipes-5.3.0.jar:5.3.0] at > org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) > ~[curator-recipes-5.3.0.jar:5.3.0] at > ... ... > ``` > -- This message was sent by Atlassian Jira (v8.20.10#820010)