[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-4689:
--------------------------------------
    Labels: pull-request-available  (was: )

> Node may not accessible due the the inconsistent ACL reference map after SNAP 
> sync (again)
> ------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4689
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4689
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.0, 3.7.0, 3.8.0
>            Reporter: Adam Yi
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Zookeeper, we do a "fuzzy snapshot". It means that we don't make a copy of 
> the DataTree or grab a lock or anything when serializing a DataTree. Instead, 
> we note down the zxid when we start serializing DataTree. We serialize the 
> DataTree while it's getting mutated and replay the \{transactions after 
> starting to take snapshot} after deserializing the DataTree. The idea is that 
> those transactions should be idempotent.
> Zookeeper also implements its own interned ACL. It keeps a [long -> ACL] map 
> and store the `long` in each node as nodes tend to share the same ACL.
> When serializing DataTree, we first serialize the ACL cache and then 
> serialize the nodes. It's possible that with the following sequence, a node 
> points to an invalid ACL entry:
> 1. Serialize ACL
> 2. Create node with new ACL
> 3. Serialize node
> ZOOKEEPER-3306 fixes this by making sure to insert the ACL to cache upon 
> calling `DataTree.createNode` when replaying transactions and when the node 
> already exists. However, we only insert it to the cache, but do not set the 
> interned ACL in the node to point to the new entry.
> It's possible that the longval we get for the ACL is inconsistent, even 
> though we follow the same zxid ordering of events. Specifically, we keep a 
> [aclIndex] pointing to the max entry that currently exists and increment that 
> whenever we need to intern a new ACL we've never seen before.
> With ZOOKEEPER-2214, we started to do reference counting in ACL cache and 
> remove no-longer used entries from the cache. 
> Say the current aclIndex is 10. If we create a node with ACL unseen before 
> and delete that node, aclIndex will increment to 11. However, when we 
> deserialize the tree, we'll set aclIndex to the max existent ACL cache entry, 
> so it's reverted back to 10. aclIndex inconsistency on its own is fine but it 
> causes problem to the ZOOKEEPER-3306 patch.
> Now if we follow the same scenario mentioned in ZOOKEEPER-3306:
>  # Leader creates ACL entry 11 and delete it due to node deletion
>  # Server A starts to have snap sync with leader
>  # After serializing the ACL map to Server A, there is a txn T1 to create a 
> node N1 with new ACL_1 which was not exist in ACL map
>  # On leader, after this txn, the ACL map will be 12 -> (ACL_1, COUNT: 1), 
> and data tree N1 -> 12
>  # On server A, it will be ACL map with max ID 10, and N1 -> 12 in fuzzy 
> snapshot
>  # When replaying the txn T1, it will add 11 -> (ACL_1, COUNT: 1) to the ACL 
> cache but the node N1 still points to 12.
> N1 still points to invalid ACL entry.
> There are two ways to fix this:
>  # Make aclIndex consistent upon re-deserialization (by either serializing it 
> in snapshot or paying special attention to decrement it when removing cache)
>  # Fix ZOOKEEPER-3306 patch so that we also override the ACL of node to new 
> key if previous entry does not exist in the ACL table.
>  
> I think solution 2 is nicer as aclIndex inconsistency itself is not a 
> problem. With solution 1, we're still implicitly depending on aclIndex 
> consistency and ordering of events. It's harder to reason about and it seems 
> more fragile than solution 1.
> I'm going to send a patch for solution 2 but please let me know if you 
> disagree and I'm happy to go with solution 1 instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to