Adam Yi created ZOOKEEPER-4689:
----------------------------------
Summary: Node may not accessible due the the inconsistent ACL
reference map after SNAP sync (again)
Key: ZOOKEEPER-4689
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4689
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.8.0, 3.7.0, 3.6.0
Reporter: Adam Yi
In Zookeeper, we do a "fuzzy snapshot". It means that we don't make a copy of
the DataTree or grab a lock or anything when serializing a DataTree. Instead,
we note down the zxid when we start serializing DataTree. We serialize the
DataTree while it's getting mutated and replay the \{transactions after
starting to take snapshot} after deserializing the DataTree. The idea is that
those transactions should be idempotent.
Zookeeper also implements its own interned ACL. It keeps a [long -> ACL] map
and store the `long` in each node as nodes tend to share the same ACL.
When serializing DataTree, we first serialize the ACL cache and then serialize
the nodes. It's possible that with the following sequence, a node points to an
invalid ACL entry:
1. Serialize ACL
2. Create node with new ACL
3. Serialize node
ZOOKEEPER-3306 fixes this by making sure to insert the ACL to cache upon
calling `DataTree.createNode` when replaying transactions and when the node
already exists. However, we only insert it to the cache, but do not set the
interned ACL in the node to point to the new entry.
It's possible that the longval we get for the ACL is inconsistent, even though
we follow the same zxid ordering of events. Specifically, we keep a [aclIndex]
pointing to the max entry that currently exists and increment that whenever we
need to intern a new ACL we've never seen before.
With ZOOKEEPER-2214, we started to do reference counting in ACL cache and
remove no-longer used entries from the cache.
Say the current aclIndex is 10. If we create a node with ACL unseen before and
delete that node, aclIndex will increment to 11. However, when we deserialize
the tree, we'll set aclIndex to the max existent ACL cache entry, so it's
reverted back to 10. aclIndex inconsistency on its own is fine but it causes
problem to the ZOOKEEPER-3306 patch.
Now if we follow the same scenario mentioned in ZOOKEEPER-3306:
# Leader creates ACL entry 11 and delete it due to node deletion
# Server A starts to have snap sync with leader
# After serializing the ACL map to Server A, there is a txn T1 to create a
node N1 with new ACL_1 which was not exist in ACL map
# On leader, after this txn, the ACL map will be 12 -> (ACL_1, COUNT: 1), and
data tree N1 -> 12
# On server A, it will be ACL map with max ID 10, and N1 -> 12 in fuzzy
snapshot
# When replaying the txn T1, it will add 11 -> (ACL_1, COUNT: 1) to the ACL
cache but the node N1 still points to 12.
N1 still points to invalid ACL entry.
There are two ways to fix this:
# Make aclIndex consistent upon re-deserialization (by either serializing it
in snapshot or paying special attention to decrement it when removing cache)
# Fix ZOOKEEPER-3306 patch so that we also override the ACL of node to new key
if previous entry does not exist in the ACL table.
I think solution 2 is nicer as aclIndex inconsistency itself is not a problem.
With solution 1, we're still implicitly depending on aclIndex consistency and
ordering of events. It's harder to reason about and it seems more fragile than
solution 1.
I'm going to send a patch for solution 2 but please let me know if you disagree
and I'm happy to go with solution 1 instead.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)