[
https://issues.apache.org/jira/browse/ZOOKEEPER-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andor Molnar updated ZOOKEEPER-4846:
------------------------------------
Priority: Blocker (was: Major)
> Failure to reload database due to missing ACL
> ---------------------------------------------
>
> Key: ZOOKEEPER-4846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4846
> Project: ZooKeeper
> Issue Type: Bug
> Reporter: Damien Diederen
> Assignee: Damien Diederen
> Priority: Blocker
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> ZooKeeper snapshots are {_}fuzzy{_}, as the server does not stop processing
> requests while ACLs and nodes are being streamed to disk.
> ACLs, notably, are streamed {_}first{_}, as a mapping between the full
> serialized ACL and an "ACL ID" referenced by the node.
> Consequently, a snapshot can very well contain ACL IDs which do not exist in
> the mapping. Prior to ZOOKEEPER-4799, such situations would produce harmless
> (if annoying) "Ignoring acl XYZ as it does not exist in the cache" INFO
> entries in the server logs.
> With ZOOKEEPER-4799, we started "eagerly" fetching the referenced ACLs in
> {{DataTree}} operations such as {{{}createNode{}}}, {{{}deleteNode{}}},
> etc.—as opposed to just fetching them from request processors.
> This can result in fatal errors during the {{fastForwardFromEdits}} phase of
> restoring a database, when transactions are processed on top of an
> inconsistent data tree—preventing the server from starting.
> The errors are thrown in this code path:
> {code:java}
> // ReferenceCountedACLCache.java:90
> List<ACL> acls = longKeyMap.get(longVal);
> if (acls == null) {
> LOG.error("ERROR: ACL not available for long {}", longVal);
> throw new RuntimeException("Failed to fetch acls for " + longVal);
> }
> {code}
> Here is a scenario leading to such a failure:
> * An existing node {{{}/foo{}}}, sporting an unique ACL, is deleted. This is
> recorded in transaction log {{{}$SNAP-1{}}}; said ACL is also deallocated;
> * Snapshot {{$SNAP}} is started;
> * The ACL map is serialized to {{{}$SNAP{}}};
> * A new node {{/foo}} sporting the same unique ACL is created in a portion
> of the data tree which still has to be serialized;
> * Node {{/foo}} is serialized to {{{}$SNAP{}}}—but its ACL isn't;
> * The server is restarted;
> * The {{DataTree}} is initialized from {{{}$SNAP{}}}, including node
> {{/foo}} with a dangling ACL reference;
> * Transaction log {{$SNAP-1}} is being replayed, leading to a
> {{{}deleteNode("/foo"){}}};
> * {{getACL(node)}} panics, preventing a successful restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)