Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

via GitHub Tue, 09 Jun 2026 08:00:39 -0700


kerneltime commented on PR #10470:
URL: https://github.com/apache/ozone/pull/10470#issuecomment-4661084173


   ## Self-review (Round 2): adversarial pass over my own draft, force-pushed 
at `d7aa14e066`
   
   I ran a 5-persona adversarial review (architect / security / correctness / 
testing / integration) on the original `4770df45` draft, fixed everything they 
surfaced, then ran the same 5 personas again against the fixes. Round 2 found 3 
real bugs in my Round 1 fixes plus 1 documentation gap. All addressed in this 
push.
   
   ### Round 1 findings → addressed in earlier passes
   
   - **`SCMConnectionManager.refreshSCMServer` left `StateContext` per-endpoint 
queues keyed by the stale address.** Incremental container reports / container 
actions / pipeline actions queued under the old key were silently undeliverable 
after a swap. Fix: new `StateContext.migrateEndpoint(old, new)` re-keys all 
five maps; called from `HeartbeatEndpointTask.maybeRefreshScmAddress` after a 
successful swap.
   - **Connection-class filter excluded `SocketTimeoutException`** — the AWS 
EC2 / EKS silent-drop case the JIRA description names as load-bearing. Fix: 
extracted to a shared `org.apache.hadoop.hdds.utils.ConnectionFailureUtils` 
that also matches `EOFException` and `SocketException`, with a depth-bounded 
(16) cause-chain walk to defend against pathological cycles.
   - **`OMProxyInfo.refreshAddressIfChanged` leaked the stale RPC proxy.** 
Sister SCM path stopped it; OM path did not. Fix: capture old proxy reference, 
swap fields under monitor, call `RPC.stopProxy` outside the monitor (mirrors 
`SCMFailoverProxyProviderBase`).
   - **`HadoopRpcOMFailoverProxyProvider.delegationTokenService` was `final`,** 
computed once at construction. Fix: `volatile`, recomputed in a new 
`onAddressRefreshed(nodeId)` hook called from 
`OMFailoverProxyProviderBase.maybeRefreshCurrentOmAddress` on swap.
   - DNS lookup is now performed **outside** the entry monitor (a slow / dead 
resolver could otherwise freeze every concurrent reader); swap is committed 
under a re-check that defends against lost updates.
   
   ### Round 2 findings → fixed in this push
   
   - **`StateContext.migrateEndpoint` raced on the unsynchronized `endpoints` 
HashSet and `isFullReportReadyToBeSent` HashMap.** Producers 
(`addContainerAction`, `addIncrementalReport`, `addPipelineActionIfAbsent`) 
iterate `endpoints` under different monitors than the migration, and 
`addPipelineActionIfAbsent` iterates with no monitor at all. Concurrent 
migration could throw CME or NPE under K8s pod churn — turning a recovery 
feature into a stability regression. Fix: `endpoints` is now a 
`CopyOnWriteArraySet`, `isFullReportReadyToBeSent` a `ConcurrentHashMap`, and 
producers null-skip queues whose entries the migration removed. Migration 
ordering (producers' queues first, `endpoints` set last) ensures any thread 
that observes the new endpoint membership has already been able to observe its 
queues populated. Collision case (newEndpoint already keyed) does not merge 
foreign queues.
   - **OM-side "retry-same-proxy" fix didn't actually pin `nextProxyIndex`.** 
The Round 1 code returned `FAILOVER_AND_RETRY` after a successful refresh, but 
`RetryInvocationHandler.performFailover()` reads `nextProxyIndex` (which prior 
retries may have already advanced) and lands on the wrong OM. Fix: explicitly 
call `setNextOmProxy(currentOmNodeId)` before returning, mirroring the existing 
`OMLeaderNotReady` "retry same OM" pattern. New 
`TestOMFailoverProxyProviderRefreshWired.testRefreshSuccessPinsCurrentNodeId` 
proves the fix end-to-end.
   - **The migration test was vacuous** — `STALE_ADDR.equals(refreshed)` made 
`migrateEndpoint` short-circuit on the early-return, so the assertion that the 
new key existed in the metrics map passed regardless of whether migration ran. 
Fix: build two RESOLVED-but-distinct InetSocketAddress instances via 
`InetAddress.getByAddress(byte[])` (127.0.0.99 vs 127.0.0.1) and assert that 
the OLD key is GONE and the NEW key is PRESENT in both 
`incrementalReportsQueue` and `containerActions`.
   - **Delegation-token service identifier is broken across refresh** when 
`hadoop.security.token.service.use_ip=true` (Hadoop default). The token's 
frozen aggregated string contains old IPs; the post-refresh per-OM service 
query is built from the new IP, and the substring-match selector fails. This is 
the same prerequisite HADOOP-17068 carries. Documented loudly in 
`ozone-default.xml` on the `ozone.client.failover.resolve-needed` entry: secure 
clusters MUST also set `hadoop.security.token.service.use_ip=false` in 
`core-site.xml`. Comment in `HadoopRpcOMFailoverProxyProvider` corrected (the 
original claim that substring-match handled refresh was inverted).
   - **`HeartbeatEndpointTask` catch block now filters by connection class** 
before deciding to refresh. Previously it would refresh on ANY `IOException` 
once the threshold was met, including `AccessControlException` (peer reachable, 
application-layer rejection). Now uses 
`ConnectionFailureUtils.isConnectionFailure` — symmetric with the OM/SCM 
provider gates. New negative test asserts `AccessControlException` at threshold 
does NOT trigger refresh.
   - **`refreshSCMServer` now refuses to operate on passive (Recon) 
endpoints.** Today's protection is accidental (Recon's `EndpointStateMachine` 
ctor leaves `hostAndPort=null`); explicit `if (existing.isPassive()) return 
null;` guard prevents a future maintainer from accidentally downgrading Recon 
to SCM-protocol on swap.
   
   ### Test coverage (~85 tests, all green locally)
   
   | File | Count | Purpose |
   |---|---|---|
   | `TestConnectionFailureUtils` (new) | 20 | Filter classification: bare + 
IOException-wrapped, deep nesting, application-level rejects, length-2 cycle, 
1024-deep chain |
   | `TestOMFailoverProxyProviderRefreshWired` (new) | 5 | Wired retry path: 
`SocketTimeoutException` triggers refresh, `ConnectException` triggers refresh, 
`OMException` does NOT, flag-off does NOT, refresh-success pins current 
`nodeId` |
   | `TestHeartbeatEndpointTaskDnsRefresh` (new) | 6 | DN catch-block: flag-on 
+ threshold + connection-class fires; flag-off does NOT; below-threshold does 
NOT; null host:port does NOT; `AccessControlException` does NOT; StateContext 
queues actually rekey |
   | `TestSCMConnectionManagerDnsRefreshE2E` (this PR) | 1 | Real-RPC swap 
mechanism with `@Timeout(30)` |
   | `TestOMProxyInfoDnsRefresh` (this PR, expanded) | 4 | Per-instance 
refresh: no-op preserves proxy, swap on IP change, rebuilt proxy uses new 
address, dtService updates |
   | `TestSCMFailoverProxyProviderRefresh` (this PR) | 3 | SCM swap mechanism |
   | `TestSCMConnectionManager` (this PR) | 6 (1 prior + 5 new) | 
resolveLatestAddress edge cases, refreshSCMServer happy-path |
   | `TestOzoneManagerRatisServer` (this PR) | 6 (5 prior + 1 new) | Ratis 
address is hostname:port, never IP:port |
   | Existing regressions verified non-regressed | 34 | TestEndPoint (17), 
TestOMFailoverProxyProvider (8), TestOMFailovers (1), TestHeartbeatEndpointTask 
(8) |
   
   ### What's still out of scope for this PR
   
   - HDFS-14118-style construction-time DNS fan-out (a different problem; 
round-robin DNS for HA — worth a follow-on JIRA if Ozone deployments need it).
   - The Ratis quorum-loss exit-0 issue in `SCMStateMachine.close()` (filed 
separately).
   - Ratis hostname-only behavioural change for IP-literal `ozone.om.address` 
configs — kept conditional on operator config rather than gated by 
`resolve-needed`. Today's `NodeDetails.getRatisHostPortStr()` chains through 
`getHostName()` which can trigger reverse DNS for IP-literal-configured peers 
across some JDK vendors. If anyone with deeper Apache Ozone background thinks 
this needs a defensive IP-detection branch, happy to add one — flagging it 
explicitly because Round 2 surfaced it as a remaining concern.
   
   ### Diff summary
   
   ```
   25 files changed, 2260 insertions(+), 56 deletions(-)
   ```
   
   PR head: `d7aa14e066778a97d37a7171b582326083ab6218` (was `4770df45515`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

Reply via email to