kerneltime commented on PR #10470: URL: https://github.com/apache/ozone/pull/10470#issuecomment-4661908169
## Self-review (Round 3): adversarial pass with a failure-injection lens, force-pushed at `774c8eb4` After Copilot caught a textbook resource-management bug that two of my prior adversarial review rounds had missed, I added a **failure-injection lens** to my review skill (every persona must walk every I/O seam and ask "what is the state if this throws right here?") and ran a Round 3 that included this lens. Round 3 found 3 real issues that all prior rounds plus Copilot missed, plus 3 medium concerns. All addressed in this push. ### Round 3 findings → fixed in this push - **`OMProxyInfo.refreshAddressIfChanged` had a partial-mutation window inside the synchronized block.** The order was `rpcAddr = refreshed` then `dtService = SecurityUtil.buildTokenService(refreshed)` then `proxy = null`. If `buildTokenService` ever threw (rare but possible with a malformed-but-resolved address or a future strictness change), `rpcAddr` would already be on the new IP but `dtService` would still hold the old service identifier and `proxy` would still point at the old peer. Worse: subsequent refresh attempts would short-circuit at the equality check (`refreshed.getAddress().equals(rpcAddr.getAddress())`) because `rpcAddr` already matches — the entry would be permanently wedged in an inconsistent state until JVM restart. **Fix: compute `newDtService` outside the synchronized block; commit `rpcAddr`/`dtService`/`proxy=null` atomically inside.** Mirrors the build-first / swap-atomically discipline that the SCM-side path and Copilot's `refreshSCMServer` fix already en force. - **`OMFailoverProxyProviderBase.maybeRefreshCurrentOmAddress` performed DNS lookup under the provider monitor.** R1 moved the DNS call outside the per-instance `OMProxyInfo` monitor, but the *provider* monitor was still held across the lookup. A slow / dead resolver would freeze every concurrent `synchronized` provider operation (`performFailover`, `selectNextOmProxy`, `setNextOmProxy`, `getCurrentProxyOMNodeId`, etc.). Fix: read `nodeId` and `info` under the monitor, drop it for the DNS-doing `info.refreshAddressIfChanged()` call, re-acquire briefly to invoke `onAddressRefreshed`. Mirrors the SCM-side pattern in `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged` which got this right in R1. - **`testRefreshSuccessPinsCurrentNodeId` was vacuous.** With `currentProxyIndex` and `nextProxyIndex` both initialised to 0 from construction, `performFailover` was a no-op (`currentProxyIndex = nextProxyIndex = 0`), so the assertion `beforeNode == getCurrentProxyOMNodeId()` held regardless of whether the C2 pin code (`setNextOmProxy(omNodeId)` in `OMFailoverProxyProviderBase.shouldRetry`) ran. Fix: pre-advance `nextProxyIndex` by triggering a non-refresh `shouldRetry` first (with an `IOException` that is NOT a connection-class failure, so the refresh hook is not invoked); only then trigger the connection-class `shouldRetry`. The pin must now pull `nextProxyIndex` back to the original node — if the pin code is broken, this test fails. - **`EndpointStateMachine.close()` could leak its executor thread on close failure.** The original sequence was `endPoint.close(); executorService.shutdown();` — if `endPoint.close()` (which calls `RPC.stopProxy`) raised a `RuntimeException`, the executor was never shut down. The caller (`SCMConnectionManager.refreshSCMServer`) catches the RuntimeException, so the leak was silent. Fix: wrap in try/finally so `executorService.shutdown()` always runs. - **`HadoopRpcOMFollowerReadFailoverProxyProvider.getConnectionId` read `getCurrentProxy().proxy` directly,** bypassing the per-instance monitor. Pre-PR the inherited `proxy` field was effectively final (assigned only at construction). Post-PR, `OMProxyInfo.refreshAddressIfChanged` mutates it under `synchronized(this)`. A direct field read can observe a torn value. Fix: replace with `getCurrentProxy().getProxy()` (the synchronized accessor). - **`SCMConnectionManager.refreshSCMServer` had no guard against the freshly-resolved IP colliding with another already-registered SCM peer key.** Realistic scenario: kube-dns transiently maps SCM-A's hostname to SCM-B's current IP. Without a guard, `scmMachines.put(refreshed, replacement)` would silently overwrite SCM-B's `EndpointStateMachine`, leaking its executor thread and its task state. Fix: refuse the swap when `!refreshed.equals(oldAddress) && scmMachines.containsKey(refreshed)`; leave the stale endpoint in place and let the next heartbeat retry DNS. ### Why a Round 3 was needed After Copilot caught the `remove-then-add` resource bug at `SCMConnectionManager.refreshSCMServer:230` that two of my five-persona adversarial review rounds had missed, I diagnosed the gap: my personas were asking "is the success path correct?" and missing "what happens if the I/O call inside the critical section throws?" The lens was wrong, not the personas. I updated my review tooling (`/ritesh-view`'s persona prompts plus a new focused `/code-review-failure-injection` skill, plus my global CLAUDE.md `Resource Management and Failure Atomicity` and `Failure-injection lens for code review` sections) to make the failure-injection lens REQUIRED on every persona's review, with specific deliverables (failure-inventory tables, atomic-replace audit, I/O-under-lock audit, caller-surprise audit). Round 3 then ran the same five personas plus the focused failure-injection sweep against the post-Copilot diff. The findings above are what the new lens caught. ### Test coverage All 86 tests in the touched-module sweep still pass: - `TestConnectionFailureUtils` (20) - `TestOMFailoverProxyProviderRefreshWired` (5, including the now-non-vacuous `testRefreshSuccessPinsCurrentNodeId`) - `TestHeartbeatEndpointTaskDnsRefresh` (6) - `TestSCMConnectionManager` (7, including the Copilot rollback regression) - `TestSCMConnectionManagerDnsRefreshE2E` (1) - `TestOMProxyInfoDnsRefresh` (4) - `TestSCMFailoverProxyProviderRefresh` (3) - `TestOzoneManagerRatisServer` (6) - Plus existing regressions: `TestEndPoint` (17), `TestOMFailoverProxyProvider` (8), `TestOMFailovers` (1), `TestHeartbeatEndpointTask` (8) ### Notes that remain in scope but unchanged - **Reverse-DNS attack-surface widening on IP-literal-configured Ratis peers.** R1 noted this; R3's Security persona confirmed the specific delta (the new `getRatisHostPortStr` call chains through `getHostName()` which can trigger PTR lookup). Defense-in-depth concern only — the existing mTLS layer with Ozone CA-signed certs constrains exploitability. Worth a follow-on JIRA to either prefer `getHostString()` (no PTR) or document IP-literal configs as unsupported with `resolve-needed=on`. - **Operator-facing documentation gap.** The new `ozone.client.failover.resolve-needed` and `ozone.datanode.scm.heartbeat.address.refresh.threshold` flags plus the `hadoop.security.token.service.use_ip=false` co-config requirement are documented inline in `ozone-default.xml` only — not in user-facing markdown under `hadoop-hdds/docs/` or in any deploy artifact. Worth a separate cleanup PR. - **Follower-read DNS refresh.** `HadoopRpcOMFollowerReadFailoverProxyProvider` invokes `refreshAddressIfChanged` only via the inherited shouldRetry → leader path. A follower OM that's been pod-rescheduled gets refreshed only when the failover provider's leader happens to be that same follower. Tactical — not a regression vs. pre-PR — but worth noting. ### Diff summary ``` 6 files changed, 104 insertions(+), 19 deletions(-) ``` PR head: `774c8eb4d3991c5422a0fbfc315758036d3a1438` (was `642dd17a497`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
