[
https://issues.apache.org/jira/browse/ZOOKEEPER-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ZOOKEEPER-5011:
--------------------------------------
Labels: pull-request-available (was: )
> FileTxnSnapLog throws uninformative NPE when accessed via stale reference
> after server restart
> ----------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-5011
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5011
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.9.4
> Reporter: ConfX
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In scenarios involving server restarts or failovers, application code may
> inadvertently hold a stale reference to a `ZooKeeperServer` instance whose
> `FileTxnSnapLog` has been closed. When methods like `save()` or `restore()`
> are called through such stale references, the current implementation throws a
> raw `NullPointerException` with no message, making it extremely difficult to
> diagnose the root cause.
>
> The current behavior forces developers to trace through the code to
> understand that the NPE is caused by the resource being closed, rather than
> immediately indicating the problem.
> h2. To reproduce:
> 1. Application code obtains a reference to `ZooKeeperServer` (e.g., `server =
> factory.getZooKeeperServer()`)
> 2. Server undergoes a restart (graceful shutdown + new instance creation)
> 3. The old server's `FileTxnSnapLog.close()` is called during shutdown,
> setting internal fields to `null`
> 4. Application code still holds the stale reference to the old server
> 5. When `server.takeSnapshot()` is called, it invokes `FileTxnSnapLog.save()`
> on the closed instance
> 6. Result: Uninformative `NullPointerException` with no indication of the
> actual problem
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:482)
> at
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:575)
> ... {code}
> When `close()` is called, it sets internal fields to `null` without any
> tracking:
> {code:java}
> // FileTxnSnapLog.java:623-634
> public void close() throws IOException {
> TxnLog txnLogToClose = txnLog;
> if (txnLogToClose != null) {
> txnLogToClose.close();
> }
> txnLog = null; // Set to null, no closed flag
> SnapShot snapSlogToClose = snapLog;
> if (snapSlogToClose != null) {
> snapSlogToClose.close();
> }
> snapLog = null; // Set to null, no closed flag
> }{code}
>
> Subsequently, methods like `save()` access these fields without null checks:
> {code:java}
> // FileTxnSnapLog.java:482
> snapLog.serialize(dataTree, sessionsWithTimeouts, snapshotFile, syncSnap);
> // NPE thrown here if snapLog is null {code}
> h2. Proposed Apporach:
> When a closed `FileTxnSnapLog` is accessed (regardless of the reason), it
> should throw an `IllegalStateException` with a clear message:
> {code:java}
> java.lang.IllegalStateException: FileTxnSnapLog has been closed
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:XXX)
> ... {code}
> So we can add a `closed` flag and a helper method to provide informative
> error messages:
> {code:java}
> // ... existing fields ...
> private volatile boolean closed = false; /**
> * Throws IllegalStateException if this FileTxnSnapLog has been closed.
> * This helps diagnose issues with stale references after server restarts.
> */
> private void checkNotClosed() {
> if (closed) {
> throw new IllegalStateException(
> "FileTxnSnapLog has been closed. " +
> "This may indicate a stale reference to a stopped server
> instance.");
> }
> }
> public void close() throws IOException {
> if (closed) {
> return; // Already closed, make idempotent
> }
> closed = true; TxnLog txnLogToClose = txnLog;
> if (txnLogToClose != null) {
> txnLogToClose.close();
> }
> txnLog = null;
> SnapShot snapSlogToClose = snapLog;
> if (snapSlogToClose != null) {
> snapSlogToClose.close();
> }
> snapLog = null;
> }
> public File save(
> DataTree dataTree,
> ConcurrentHashMap<Long, Integer> sessionsWithTimeouts,
> boolean syncSnap) throws IOException {
> checkNotClosed(); // Add this check
> long lastZxid = dataTree.lastProcessedZxid;
> // ... rest of method unchanged
> } // Similar checkNotClosed() calls added to other public methods
> {code}
> I'm happy to send a PR for this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)