Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

via GitHub Tue, 09 Jun 2026 09:40:45 -0700


kerneltime commented on PR #10470:
URL: https://github.com/apache/ozone/pull/10470#issuecomment-4661908169


   ## Self-review (Round 3): adversarial pass with a failure-injection lens, 
force-pushed at `774c8eb4`
   
   After Copilot caught a textbook resource-management bug that two of my prior 
adversarial review rounds had missed, I added a **failure-injection lens** to 
my review skill (every persona must walk every I/O seam and ask "what is the 
state if this throws right here?") and ran a Round 3 that included this lens. 
Round 3 found 3 real issues that all prior rounds plus Copilot missed, plus 3 
medium concerns. All addressed in this push.
   
   ### Round 3 findings → fixed in this push
   
   - **`OMProxyInfo.refreshAddressIfChanged` had a partial-mutation window 
inside the synchronized block.** The order was `rpcAddr = refreshed` then 
`dtService = SecurityUtil.buildTokenService(refreshed)` then `proxy = null`. If 
`buildTokenService` ever threw (rare but possible with a malformed-but-resolved 
address or a future strictness change), `rpcAddr` would already be on the new 
IP but `dtService` would still hold the old service identifier and `proxy` 
would still point at the old peer. Worse: subsequent refresh attempts would 
short-circuit at the equality check 
(`refreshed.getAddress().equals(rpcAddr.getAddress())`) because `rpcAddr` 
already matches — the entry would be permanently wedged in an inconsistent 
state until JVM restart. **Fix: compute `newDtService` outside the synchronized 
block; commit `rpcAddr`/`dtService`/`proxy=null` atomically inside.** Mirrors 
the build-first / swap-atomically discipline that the SCM-side path and 
Copilot's `refreshSCMServer` fix already en
 force.
   - **`OMFailoverProxyProviderBase.maybeRefreshCurrentOmAddress` performed DNS 
lookup under the provider monitor.** R1 moved the DNS call outside the 
per-instance `OMProxyInfo` monitor, but the *provider* monitor was still held 
across the lookup. A slow / dead resolver would freeze every concurrent 
`synchronized` provider operation (`performFailover`, `selectNextOmProxy`, 
`setNextOmProxy`, `getCurrentProxyOMNodeId`, etc.). Fix: read `nodeId` and 
`info` under the monitor, drop it for the DNS-doing 
`info.refreshAddressIfChanged()` call, re-acquire briefly to invoke 
`onAddressRefreshed`. Mirrors the SCM-side pattern in 
`SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged` which got this 
right in R1.
   - **`testRefreshSuccessPinsCurrentNodeId` was vacuous.** With 
`currentProxyIndex` and `nextProxyIndex` both initialised to 0 from 
construction, `performFailover` was a no-op (`currentProxyIndex = 
nextProxyIndex = 0`), so the assertion `beforeNode == 
getCurrentProxyOMNodeId()` held regardless of whether the C2 pin code 
(`setNextOmProxy(omNodeId)` in `OMFailoverProxyProviderBase.shouldRetry`) ran. 
Fix: pre-advance `nextProxyIndex` by triggering a non-refresh `shouldRetry` 
first (with an `IOException` that is NOT a connection-class failure, so the 
refresh hook is not invoked); only then trigger the connection-class 
`shouldRetry`. The pin must now pull `nextProxyIndex` back to the original node 
— if the pin code is broken, this test fails.
   - **`EndpointStateMachine.close()` could leak its executor thread on close 
failure.** The original sequence was `endPoint.close(); 
executorService.shutdown();` — if `endPoint.close()` (which calls 
`RPC.stopProxy`) raised a `RuntimeException`, the executor was never shut down. 
The caller (`SCMConnectionManager.refreshSCMServer`) catches the 
RuntimeException, so the leak was silent. Fix: wrap in try/finally so 
`executorService.shutdown()` always runs.
   - **`HadoopRpcOMFollowerReadFailoverProxyProvider.getConnectionId` read 
`getCurrentProxy().proxy` directly,** bypassing the per-instance monitor. 
Pre-PR the inherited `proxy` field was effectively final (assigned only at 
construction). Post-PR, `OMProxyInfo.refreshAddressIfChanged` mutates it under 
`synchronized(this)`. A direct field read can observe a torn value. Fix: 
replace with `getCurrentProxy().getProxy()` (the synchronized accessor).
   - **`SCMConnectionManager.refreshSCMServer` had no guard against the 
freshly-resolved IP colliding with another already-registered SCM peer key.** 
Realistic scenario: kube-dns transiently maps SCM-A's hostname to SCM-B's 
current IP. Without a guard, `scmMachines.put(refreshed, replacement)` would 
silently overwrite SCM-B's `EndpointStateMachine`, leaking its executor thread 
and its task state. Fix: refuse the swap when `!refreshed.equals(oldAddress) && 
scmMachines.containsKey(refreshed)`; leave the stale endpoint in place and let 
the next heartbeat retry DNS.
   
   ### Why a Round 3 was needed
   
   After Copilot caught the `remove-then-add` resource bug at 
`SCMConnectionManager.refreshSCMServer:230` that two of my five-persona 
adversarial review rounds had missed, I diagnosed the gap: my personas were 
asking "is the success path correct?" and missing "what happens if the I/O call 
inside the critical section throws?" The lens was wrong, not the personas.
   
   I updated my review tooling (`/ritesh-view`'s persona prompts plus a new 
focused `/code-review-failure-injection` skill, plus my global CLAUDE.md 
`Resource Management and Failure Atomicity` and `Failure-injection lens for 
code review` sections) to make the failure-injection lens REQUIRED on every 
persona's review, with specific deliverables (failure-inventory tables, 
atomic-replace audit, I/O-under-lock audit, caller-surprise audit). Round 3 
then ran the same five personas plus the focused failure-injection sweep 
against the post-Copilot diff. The findings above are what the new lens caught.
   
   ### Test coverage
   
   All 86 tests in the touched-module sweep still pass:
   - `TestConnectionFailureUtils` (20)
   - `TestOMFailoverProxyProviderRefreshWired` (5, including the 
now-non-vacuous `testRefreshSuccessPinsCurrentNodeId`)
   - `TestHeartbeatEndpointTaskDnsRefresh` (6)
   - `TestSCMConnectionManager` (7, including the Copilot rollback regression)
   - `TestSCMConnectionManagerDnsRefreshE2E` (1)
   - `TestOMProxyInfoDnsRefresh` (4)
   - `TestSCMFailoverProxyProviderRefresh` (3)
   - `TestOzoneManagerRatisServer` (6)
   - Plus existing regressions: `TestEndPoint` (17), 
`TestOMFailoverProxyProvider` (8), `TestOMFailovers` (1), 
`TestHeartbeatEndpointTask` (8)
   
   ### Notes that remain in scope but unchanged
   
   - **Reverse-DNS attack-surface widening on IP-literal-configured Ratis 
peers.** R1 noted this; R3's Security persona confirmed the specific delta (the 
new `getRatisHostPortStr` call chains through `getHostName()` which can trigger 
PTR lookup). Defense-in-depth concern only — the existing mTLS layer with Ozone 
CA-signed certs constrains exploitability. Worth a follow-on JIRA to either 
prefer `getHostString()` (no PTR) or document IP-literal configs as unsupported 
with `resolve-needed=on`.
   - **Operator-facing documentation gap.** The new 
`ozone.client.failover.resolve-needed` and 
`ozone.datanode.scm.heartbeat.address.refresh.threshold` flags plus the 
`hadoop.security.token.service.use_ip=false` co-config requirement are 
documented inline in `ozone-default.xml` only — not in user-facing markdown 
under `hadoop-hdds/docs/` or in any deploy artifact. Worth a separate cleanup 
PR.
   - **Follower-read DNS refresh.** 
`HadoopRpcOMFollowerReadFailoverProxyProvider` invokes 
`refreshAddressIfChanged` only via the inherited shouldRetry → leader path. A 
follower OM that's been pod-rescheduled gets refreshed only when the failover 
provider's leader happens to be that same follower. Tactical — not a regression 
vs. pre-PR — but worth noting.
   
   ### Diff summary
   
   ```
   6 files changed, 104 insertions(+), 19 deletions(-)
   ```
   
   PR head: `774c8eb4d3991c5422a0fbfc315758036d3a1438` (was `642dd17a497`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

Reply via email to