chihsuan opened a new pull request, #10556:
URL: https://github.com/apache/ozone/pull/10556
## What changes were proposed in this pull request?
`TestDeadNodeHandler.testOnMessage` fails intermittently with either
`NullPointerException: Parent == null` (from `HealthyReadOnlyNodeHandler`) or
`AssertionFailedError: expected: <false> but was: <true>` (when asserting the
dead node was removed from the cluster network topology).
Root cause: the test uses a real SCM whose `NodeStateManager` runs a periodic
health check (`ozone.scm.heartbeat.thread.interval`, default 3s). The test
drives node health transitions manually via `setNodeHealthState`, which
forces
a node to `DEAD` but does not age its last heartbeat. When the background
health check runs, it sees a `DEAD` node with a fresh heartbeat and
resurrects
it (`DEAD -> HEALTHY_READONLY`), concurrently mutating the `NetworkTopology`.
This races with the handlers under test:
- `DeadNodeHandler` re-reads the node status before removing it from the
topology and skips removal when the node is no longer `DEAD`, so the node
stays in the topology and the `assertFalse(... contains ...)` fails.
- The concurrent topology add/remove trips the parent sanity check in
`HealthyReadOnlyNodeHandler.onMessage`, producing the NPE.
The guards in the production handlers (introduced by HDDS-14834) are correct;
the problem is that the test does not isolate itself from the periodic health
check. The fix sets the heartbeat process interval high in `setup()` so the
background check does not fire during the test, matching the existing pattern
in this package of controlling the health check via configuration
(`TestSCMNodeManager`). The `@Flaky` tag is removed now that the root cause
is
addressed.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14977
## How was this patch tested?
Ran `TestDeadNodeHandler#testOnMessage` 8 times in a row locally; all passed,
with the per-run time dropping from the previous 14-17s (under the race) to a
steady ~8s. The full `TestDeadNodeHandler` class also passes. Verified with
`checkstyle.sh`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]