ConfX created ZOOKEEPER-5011:
--------------------------------

             Summary: FileTxnSnapLog throws uninformative NPE when accessed via 
stale reference after server restart
                 Key: ZOOKEEPER-5011
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5011
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.9.4
            Reporter: ConfX


In scenarios involving server restarts or failovers, application code may 
inadvertently hold a stale reference to a `ZooKeeperServer` instance whose 
`FileTxnSnapLog` has been closed. When methods like `save()` or `restore()` are 
called through such stale references, the current implementation throws a raw 
`NullPointerException` with no message, making it extremely difficult to 
diagnose the root cause.
 
The current behavior forces developers to trace through the code to understand 
that the NPE is caused by the resource being closed, rather than immediately 
indicating the problem.
h2. To reproduce:
1. Application code obtains a reference to `ZooKeeperServer` (e.g., `server = 
factory.getZooKeeperServer()`)
2. Server undergoes a restart (graceful shutdown + new instance creation)
3. The old server's `FileTxnSnapLog.close()` is called during shutdown, setting 
internal fields to `null`
4. Application code still holds the stale reference to the old server
5. When `server.takeSnapshot()` is called, it invokes `FileTxnSnapLog.save()` 
on the closed instance
6. Result: Uninformative `NullPointerException` with no indication of the 
actual problem
{code:java}
java.lang.NullPointerException
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:482)
    at 
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:575)
    ... {code}
When `close()` is called, it sets internal fields to `null` without any 
tracking:
{code:java}
// FileTxnSnapLog.java:623-634
public void close() throws IOException {
    TxnLog txnLogToClose = txnLog;
    if (txnLogToClose != null) {
        txnLogToClose.close();
    }
    txnLog = null;  // Set to null, no closed flag
    SnapShot snapSlogToClose = snapLog;
    if (snapSlogToClose != null) {
        snapSlogToClose.close();
    }
    snapLog = null;  // Set to null, no closed flag
}{code}
 
Subsequently, methods like `save()` access these fields without null checks:
{code:java}
// FileTxnSnapLog.java:482
snapLog.serialize(dataTree, sessionsWithTimeouts, snapshotFile, syncSnap);
// NPE thrown here if snapLog is null {code}
h2. Proposed Apporach:
When a closed `FileTxnSnapLog` is accessed (regardless of the reason), it 
should throw an `IllegalStateException` with a clear message:
{code:java}
java.lang.IllegalStateException: FileTxnSnapLog has been closed
    at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:XXX)
    ... {code}
So we can add a `closed` flag and a helper method to provide informative error 
messages:
{code:java}
    // ... existing fields ...
    private volatile boolean closed = false;    /**
     * Throws IllegalStateException if this FileTxnSnapLog has been closed.
     * This helps diagnose issues with stale references after server restarts.
     */
    private void checkNotClosed() {
        if (closed) {
            throw new IllegalStateException(
                "FileTxnSnapLog has been closed. " +
                "This may indicate a stale reference to a stopped server 
instance.");
        }
    }    public void close() throws IOException {
        if (closed) {
            return;  // Already closed, make idempotent
        }
        closed = true;        TxnLog txnLogToClose = txnLog;
        if (txnLogToClose != null) {
            txnLogToClose.close();
        }
        txnLog = null;
        SnapShot snapSlogToClose = snapLog;
        if (snapSlogToClose != null) {
            snapSlogToClose.close();
        }
        snapLog = null;
    }    public File save(
        DataTree dataTree,
        ConcurrentHashMap<Long, Integer> sessionsWithTimeouts,
        boolean syncSnap) throws IOException {
        checkNotClosed();  // Add this check
        long lastZxid = dataTree.lastProcessedZxid;
        // ... rest of method unchanged
    }    // Similar checkNotClosed() calls added to other public methods {code}
I'm happy to send a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to