Adam Yi created ZOOKEEPER-4689:
----------------------------------

             Summary: Node may not accessible due the the inconsistent ACL 
reference map after SNAP sync (again)
                 Key: ZOOKEEPER-4689
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4689
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.8.0, 3.7.0, 3.6.0
            Reporter: Adam Yi


In Zookeeper, we do a "fuzzy snapshot". It means that we don't make a copy of 
the DataTree or grab a lock or anything when serializing a DataTree. Instead, 
we note down the zxid when we start serializing DataTree. We serialize the 
DataTree while it's getting mutated and replay the \{transactions after 
starting to take snapshot} after deserializing the DataTree. The idea is that 
those transactions should be idempotent.

Zookeeper also implements its own interned ACL. It keeps a [long -> ACL] map 
and store the `long` in each node as nodes tend to share the same ACL.

When serializing DataTree, we first serialize the ACL cache and then serialize 
the nodes. It's possible that with the following sequence, a node points to an 
invalid ACL entry:

1. Serialize ACL
2. Create node with new ACL
3. Serialize node

ZOOKEEPER-3306 fixes this by making sure to insert the ACL to cache upon 
calling `DataTree.createNode` when replaying transactions and when the node 
already exists. However, we only insert it to the cache, but do not set the 
interned ACL in the node to point to the new entry.

It's possible that the longval we get for the ACL is inconsistent, even though 
we follow the same zxid ordering of events. Specifically, we keep a [aclIndex] 
pointing to the max entry that currently exists and increment that whenever we 
need to intern a new ACL we've never seen before.

With ZOOKEEPER-2214, we started to do reference counting in ACL cache and 
remove no-longer used entries from the cache. 

Say the current aclIndex is 10. If we create a node with ACL unseen before and 
delete that node, aclIndex will increment to 11. However, when we deserialize 
the tree, we'll set aclIndex to the max existent ACL cache entry, so it's 
reverted back to 10. aclIndex inconsistency on its own is fine but it causes 
problem to the ZOOKEEPER-3306 patch.

Now if we follow the same scenario mentioned in ZOOKEEPER-3306:
 # Leader creates ACL entry 11 and delete it due to node deletion
 # Server A starts to have snap sync with leader
 # After serializing the ACL map to Server A, there is a txn T1 to create a 
node N1 with new ACL_1 which was not exist in ACL map
 # On leader, after this txn, the ACL map will be 12 -> (ACL_1, COUNT: 1), and 
data tree N1 -> 12
 # On server A, it will be ACL map with max ID 10, and N1 -> 12 in fuzzy 
snapshot
 # When replaying the txn T1, it will add 11 -> (ACL_1, COUNT: 1) to the ACL 
cache but the node N1 still points to 12.

N1 still points to invalid ACL entry.

There are two ways to fix this:
 # Make aclIndex consistent upon re-deserialization (by either serializing it 
in snapshot or paying special attention to decrement it when removing cache)
 # Fix ZOOKEEPER-3306 patch so that we also override the ACL of node to new key 
if previous entry does not exist in the ACL table.

 

I think solution 2 is nicer as aclIndex inconsistency itself is not a problem. 
With solution 1, we're still implicitly depending on aclIndex consistency and 
ordering of events. It's harder to reason about and it seems more fragile than 
solution 1.

I'm going to send a patch for solution 2 but please let me know if you disagree 
and I'm happy to go with solution 1 instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to