[ https://issues.apache.org/jira/browse/ZOOKEEPER-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ZOOKEEPER-4689: -------------------------------------- Labels: pull-request-available (was: ) > Node may not accessible due the the inconsistent ACL reference map after SNAP > sync (again) > ------------------------------------------------------------------------------------------ > > Key: ZOOKEEPER-4689 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4689 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.6.0, 3.7.0, 3.8.0 > Reporter: Adam Yi > Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In Zookeeper, we do a "fuzzy snapshot". It means that we don't make a copy of > the DataTree or grab a lock or anything when serializing a DataTree. Instead, > we note down the zxid when we start serializing DataTree. We serialize the > DataTree while it's getting mutated and replay the \{transactions after > starting to take snapshot} after deserializing the DataTree. The idea is that > those transactions should be idempotent. > Zookeeper also implements its own interned ACL. It keeps a [long -> ACL] map > and store the `long` in each node as nodes tend to share the same ACL. > When serializing DataTree, we first serialize the ACL cache and then > serialize the nodes. It's possible that with the following sequence, a node > points to an invalid ACL entry: > 1. Serialize ACL > 2. Create node with new ACL > 3. Serialize node > ZOOKEEPER-3306 fixes this by making sure to insert the ACL to cache upon > calling `DataTree.createNode` when replaying transactions and when the node > already exists. However, we only insert it to the cache, but do not set the > interned ACL in the node to point to the new entry. > It's possible that the longval we get for the ACL is inconsistent, even > though we follow the same zxid ordering of events. Specifically, we keep a > [aclIndex] pointing to the max entry that currently exists and increment that > whenever we need to intern a new ACL we've never seen before. > With ZOOKEEPER-2214, we started to do reference counting in ACL cache and > remove no-longer used entries from the cache. > Say the current aclIndex is 10. If we create a node with ACL unseen before > and delete that node, aclIndex will increment to 11. However, when we > deserialize the tree, we'll set aclIndex to the max existent ACL cache entry, > so it's reverted back to 10. aclIndex inconsistency on its own is fine but it > causes problem to the ZOOKEEPER-3306 patch. > Now if we follow the same scenario mentioned in ZOOKEEPER-3306: > # Leader creates ACL entry 11 and delete it due to node deletion > # Server A starts to have snap sync with leader > # After serializing the ACL map to Server A, there is a txn T1 to create a > node N1 with new ACL_1 which was not exist in ACL map > # On leader, after this txn, the ACL map will be 12 -> (ACL_1, COUNT: 1), > and data tree N1 -> 12 > # On server A, it will be ACL map with max ID 10, and N1 -> 12 in fuzzy > snapshot > # When replaying the txn T1, it will add 11 -> (ACL_1, COUNT: 1) to the ACL > cache but the node N1 still points to 12. > N1 still points to invalid ACL entry. > There are two ways to fix this: > # Make aclIndex consistent upon re-deserialization (by either serializing it > in snapshot or paying special attention to decrement it when removing cache) > # Fix ZOOKEEPER-3306 patch so that we also override the ACL of node to new > key if previous entry does not exist in the ACL table. > > I think solution 2 is nicer as aclIndex inconsistency itself is not a > problem. With solution 1, we're still implicitly depending on aclIndex > consistency and ordering of events. It's harder to reason about and it seems > more fragile than solution 1. > I'm going to send a patch for solution 2 but please let me know if you > disagree and I'm happy to go with solution 1 instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)