[jira] [Created] (CURATOR-604) Double locking issue while using InterProcessMutex lock code

Viswanathan Rajagopal (Jira) Thu, 29 Jul 2021 01:02:32 -0700

Viswanathan Rajagopal created CURATOR-604:
---------------------------------------------


             Summary: Double locking issue while using InterProcessMutex lock 
code
                 Key: CURATOR-604
                 URL: https://issues.apache.org/jira/browse/CURATOR-604
             Project: Apache Curator
          Issue Type: Bug
          Components: Recipes
    Affects Versions: 2.5.0
            Reporter: Viswanathan Rajagopal


Recently came across “Double Locking Issue (i.e. two clients acquiring lock)” 
using Curator code ( InterProcessMutex lock APIs ) in our application

*+Our use case:+*
 * Two clients attempts to acquire the zookeeper lock using Curator 
InterProcessMutex and whoever owns it would release it once sees the connection 
disconnect ( on receiving Connection.SUSPENDED / Connection.LOST Curator 
Connection Events from Connection listener)

*+Issue we noticed:+*
 * After session expired & reconnected with new session, both client seems to 
have acquired the lock. Interesting thing that we found is that one of the 
clients still holds the lock while its lock node (ephemeral) was gone

*+Things we found:+*
 * Based on our initial analysis and few test runs, we saw that *_Curator 
acquire() method acquires the lock based on “about to be deleted lock node of 
previous session”._* *Explanation :* Ephemeral node created by previous session 
was  still seen by client that reconnected with new session id until server 
cleans that up. If this happens, Curator acquire() would hold the lock.

 
 * Clearly we could see the *_race condition (in zookeeper code) between 1). 
Client reconnecting to server with new session id and 2). server deleting the 
ephemeral nodes of client’s previous session_*.

 
 * On the above mentioned race condition, if client manage to reconnect to 
server with new session id before server cleans up the ephemeral nodes of 
client’s previous session,  Curator lock acquire() who is trying to acquire the 
lock will hold the lock as it still sees the lock node in zookeeper directory. 
Eventually server would be cleaning up the ephemeral nodes leaving the Curator 
local lock thread data stale giving the illusion that it still hold the lock 
while its ephemeral node is gone

 
 * Trying to categorize the timeline events below to see if that would help us 
for better understanding,

[ CONNECTED ] :

 ** Client A & Client B calls acquire creating Node N1 & N2 respectively
 ** Client A acquire() -> holds the lock as its Node N1 is first node in 
sequence
 ** Client B acquire() -> created Node N2, watches for previous sequence node 
(blocked on wait())
 

[ SUSPENDED CONNECTION OCCURS ] :

 ** Network partition happens
 ** Curator fires SUSPENDED
 ** Client A, on receiving SUSPENDED event, attempts to release the lock
 ** Client B, on receiving SUSPENDED event, does nothing as it was not holding 
any lock (as blocked on acquire() -> wait() )  
 

[ SESSION EXPIRED ] :

 ** Client A attempts to release the lock (with retries)
 ** Client B does nothing as it was not holding any lock (and is still blocked 
on acquire() -> wait() )
 ** Server prepared to cleanup previous client session and deleting lock nodes 
created as part of previous client session
 

[ RECONNECTED ] :

 ** Client A and Client B reconnected to another server with new session id 
(Note : This happens before server managed to cleanup / delete ephemeral nodes 
of previous client session)
 ** Client A released the lock successfully (means it would delete its lock 
node N1) and attempts to acquire lock by creating lock node N3 and watches for 
previous sequence node (N2)
 ** Client B who was blocked on acquire() -> wait() would be notified with 
previous sequence node (N1) deletion -> getChildren, sorting them, and seeing 
if the lock's node is the first node in the sequence. So, Client B sees its 
lock node N2 still (which I call it as about to be deleted node by server) and 
thereby acquires the lock
 

[ AFTER FEW SECONDS ] :

 ** Server managed to delete the ephemeral node N2 as part of previous client 
session cleanup
 ** Client A who was blocked on acquire() -> wait() would be notified with 
previous sequence node (N2) deletion -> getChildren, sorting them, and seeing 
if the lock's node is the first node in the sequence and thereby acquires the 
lock
 ** Client B – its local lock thread data went stale (as its lock path N2 not 
has been deleted by Server)**

 
 * *Created a test case simulator* 
[https://github.com/ViswaNXplore/curator/blob/curator-2.5.0/patch/double_locking_test_simulator/curator-recipes/src/test/java/org/apache/curator/framework/recipes/CuratorRecipeTest.java]
 

+Approach #1 (Artificial way of reproducing the issue)+

 ## Run the test case with (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to true). 
This would delete the lock node after a pause just to simulate the state where 
the zkClient had reconnected and it still happens to see the ephermal node just 
before the server deletes it since its session has expired, but the node is 
deleted afterwards by the server.
+Approach #2 (A little manual interruption needed to reproduce the issue)+

 ## Run the test case in Debug mode (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to 
false)
 ## Artificially delaying / pausing the ephemeral lock nodes deletion as part 
of session cleanup process in server code (ZookeeperServer.close() method)
 ## After a pause (say 5s) to make one of the instance to acquire the lock, 
Artificially break the socket connection between client and server for 30s (by 
keeping breakpoint in ClientCnxnSocketNIO.doIO() method). After 30s, we would 
see session closing logs logged in server code
 ## After 1min, remove breakpoint in ClientCnxnSocketNIO.doIO() and resume both 
Thread 2 and Thread 3
 ## After that, resume server thread (thread name would be “SessionTracker”

*+Note:+* 
 * We are not certain that this race condition that we noticed in zookeeper 
code is intentional design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (CURATOR-604) Double locking issue while using InterProcessMutex lock code

Reply via email to