ConfX created ZOOKEEPER-5011:
--------------------------------
Summary: FileTxnSnapLog throws uninformative NPE when accessed via
stale reference after server restart
Key: ZOOKEEPER-5011
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5011
Project: ZooKeeper
Issue Type: Bug
Components: server
Affects Versions: 3.9.4
Reporter: ConfX
In scenarios involving server restarts or failovers, application code may
inadvertently hold a stale reference to a `ZooKeeperServer` instance whose
`FileTxnSnapLog` has been closed. When methods like `save()` or `restore()` are
called through such stale references, the current implementation throws a raw
`NullPointerException` with no message, making it extremely difficult to
diagnose the root cause.
The current behavior forces developers to trace through the code to understand
that the NPE is caused by the resource being closed, rather than immediately
indicating the problem.
h2. To reproduce:
1. Application code obtains a reference to `ZooKeeperServer` (e.g., `server =
factory.getZooKeeperServer()`)
2. Server undergoes a restart (graceful shutdown + new instance creation)
3. The old server's `FileTxnSnapLog.close()` is called during shutdown, setting
internal fields to `null`
4. Application code still holds the stale reference to the old server
5. When `server.takeSnapshot()` is called, it invokes `FileTxnSnapLog.save()`
on the closed instance
6. Result: Uninformative `NullPointerException` with no indication of the
actual problem
{code:java}
java.lang.NullPointerException
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:482)
at
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:575)
... {code}
When `close()` is called, it sets internal fields to `null` without any
tracking:
{code:java}
// FileTxnSnapLog.java:623-634
public void close() throws IOException {
TxnLog txnLogToClose = txnLog;
if (txnLogToClose != null) {
txnLogToClose.close();
}
txnLog = null; // Set to null, no closed flag
SnapShot snapSlogToClose = snapLog;
if (snapSlogToClose != null) {
snapSlogToClose.close();
}
snapLog = null; // Set to null, no closed flag
}{code}
Subsequently, methods like `save()` access these fields without null checks:
{code:java}
// FileTxnSnapLog.java:482
snapLog.serialize(dataTree, sessionsWithTimeouts, snapshotFile, syncSnap);
// NPE thrown here if snapLog is null {code}
h2. Proposed Apporach:
When a closed `FileTxnSnapLog` is accessed (regardless of the reason), it
should throw an `IllegalStateException` with a clear message:
{code:java}
java.lang.IllegalStateException: FileTxnSnapLog has been closed
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:XXX)
... {code}
So we can add a `closed` flag and a helper method to provide informative error
messages:
{code:java}
// ... existing fields ...
private volatile boolean closed = false; /**
* Throws IllegalStateException if this FileTxnSnapLog has been closed.
* This helps diagnose issues with stale references after server restarts.
*/
private void checkNotClosed() {
if (closed) {
throw new IllegalStateException(
"FileTxnSnapLog has been closed. " +
"This may indicate a stale reference to a stopped server
instance.");
}
} public void close() throws IOException {
if (closed) {
return; // Already closed, make idempotent
}
closed = true; TxnLog txnLogToClose = txnLog;
if (txnLogToClose != null) {
txnLogToClose.close();
}
txnLog = null;
SnapShot snapSlogToClose = snapLog;
if (snapSlogToClose != null) {
snapSlogToClose.close();
}
snapLog = null;
} public File save(
DataTree dataTree,
ConcurrentHashMap<Long, Integer> sessionsWithTimeouts,
boolean syncSnap) throws IOException {
checkNotClosed(); // Add this check
long lastZxid = dataTree.lastProcessedZxid;
// ... rest of method unchanged
} // Similar checkNotClosed() calls added to other public methods {code}
I'm happy to send a PR for this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)