kerneltime commented on PR #10470: URL: https://github.com/apache/ozone/pull/10470#issuecomment-4661084173
## Self-review (Round 2): adversarial pass over my own draft, force-pushed at `d7aa14e066` I ran a 5-persona adversarial review (architect / security / correctness / testing / integration) on the original `4770df45` draft, fixed everything they surfaced, then ran the same 5 personas again against the fixes. Round 2 found 3 real bugs in my Round 1 fixes plus 1 documentation gap. All addressed in this push. ### Round 1 findings → addressed in earlier passes - **`SCMConnectionManager.refreshSCMServer` left `StateContext` per-endpoint queues keyed by the stale address.** Incremental container reports / container actions / pipeline actions queued under the old key were silently undeliverable after a swap. Fix: new `StateContext.migrateEndpoint(old, new)` re-keys all five maps; called from `HeartbeatEndpointTask.maybeRefreshScmAddress` after a successful swap. - **Connection-class filter excluded `SocketTimeoutException`** — the AWS EC2 / EKS silent-drop case the JIRA description names as load-bearing. Fix: extracted to a shared `org.apache.hadoop.hdds.utils.ConnectionFailureUtils` that also matches `EOFException` and `SocketException`, with a depth-bounded (16) cause-chain walk to defend against pathological cycles. - **`OMProxyInfo.refreshAddressIfChanged` leaked the stale RPC proxy.** Sister SCM path stopped it; OM path did not. Fix: capture old proxy reference, swap fields under monitor, call `RPC.stopProxy` outside the monitor (mirrors `SCMFailoverProxyProviderBase`). - **`HadoopRpcOMFailoverProxyProvider.delegationTokenService` was `final`,** computed once at construction. Fix: `volatile`, recomputed in a new `onAddressRefreshed(nodeId)` hook called from `OMFailoverProxyProviderBase.maybeRefreshCurrentOmAddress` on swap. - DNS lookup is now performed **outside** the entry monitor (a slow / dead resolver could otherwise freeze every concurrent reader); swap is committed under a re-check that defends against lost updates. ### Round 2 findings → fixed in this push - **`StateContext.migrateEndpoint` raced on the unsynchronized `endpoints` HashSet and `isFullReportReadyToBeSent` HashMap.** Producers (`addContainerAction`, `addIncrementalReport`, `addPipelineActionIfAbsent`) iterate `endpoints` under different monitors than the migration, and `addPipelineActionIfAbsent` iterates with no monitor at all. Concurrent migration could throw CME or NPE under K8s pod churn — turning a recovery feature into a stability regression. Fix: `endpoints` is now a `CopyOnWriteArraySet`, `isFullReportReadyToBeSent` a `ConcurrentHashMap`, and producers null-skip queues whose entries the migration removed. Migration ordering (producers' queues first, `endpoints` set last) ensures any thread that observes the new endpoint membership has already been able to observe its queues populated. Collision case (newEndpoint already keyed) does not merge foreign queues. - **OM-side "retry-same-proxy" fix didn't actually pin `nextProxyIndex`.** The Round 1 code returned `FAILOVER_AND_RETRY` after a successful refresh, but `RetryInvocationHandler.performFailover()` reads `nextProxyIndex` (which prior retries may have already advanced) and lands on the wrong OM. Fix: explicitly call `setNextOmProxy(currentOmNodeId)` before returning, mirroring the existing `OMLeaderNotReady` "retry same OM" pattern. New `TestOMFailoverProxyProviderRefreshWired.testRefreshSuccessPinsCurrentNodeId` proves the fix end-to-end. - **The migration test was vacuous** — `STALE_ADDR.equals(refreshed)` made `migrateEndpoint` short-circuit on the early-return, so the assertion that the new key existed in the metrics map passed regardless of whether migration ran. Fix: build two RESOLVED-but-distinct InetSocketAddress instances via `InetAddress.getByAddress(byte[])` (127.0.0.99 vs 127.0.0.1) and assert that the OLD key is GONE and the NEW key is PRESENT in both `incrementalReportsQueue` and `containerActions`. - **Delegation-token service identifier is broken across refresh** when `hadoop.security.token.service.use_ip=true` (Hadoop default). The token's frozen aggregated string contains old IPs; the post-refresh per-OM service query is built from the new IP, and the substring-match selector fails. This is the same prerequisite HADOOP-17068 carries. Documented loudly in `ozone-default.xml` on the `ozone.client.failover.resolve-needed` entry: secure clusters MUST also set `hadoop.security.token.service.use_ip=false` in `core-site.xml`. Comment in `HadoopRpcOMFailoverProxyProvider` corrected (the original claim that substring-match handled refresh was inverted). - **`HeartbeatEndpointTask` catch block now filters by connection class** before deciding to refresh. Previously it would refresh on ANY `IOException` once the threshold was met, including `AccessControlException` (peer reachable, application-layer rejection). Now uses `ConnectionFailureUtils.isConnectionFailure` — symmetric with the OM/SCM provider gates. New negative test asserts `AccessControlException` at threshold does NOT trigger refresh. - **`refreshSCMServer` now refuses to operate on passive (Recon) endpoints.** Today's protection is accidental (Recon's `EndpointStateMachine` ctor leaves `hostAndPort=null`); explicit `if (existing.isPassive()) return null;` guard prevents a future maintainer from accidentally downgrading Recon to SCM-protocol on swap. ### Test coverage (~85 tests, all green locally) | File | Count | Purpose | |---|---|---| | `TestConnectionFailureUtils` (new) | 20 | Filter classification: bare + IOException-wrapped, deep nesting, application-level rejects, length-2 cycle, 1024-deep chain | | `TestOMFailoverProxyProviderRefreshWired` (new) | 5 | Wired retry path: `SocketTimeoutException` triggers refresh, `ConnectException` triggers refresh, `OMException` does NOT, flag-off does NOT, refresh-success pins current `nodeId` | | `TestHeartbeatEndpointTaskDnsRefresh` (new) | 6 | DN catch-block: flag-on + threshold + connection-class fires; flag-off does NOT; below-threshold does NOT; null host:port does NOT; `AccessControlException` does NOT; StateContext queues actually rekey | | `TestSCMConnectionManagerDnsRefreshE2E` (this PR) | 1 | Real-RPC swap mechanism with `@Timeout(30)` | | `TestOMProxyInfoDnsRefresh` (this PR, expanded) | 4 | Per-instance refresh: no-op preserves proxy, swap on IP change, rebuilt proxy uses new address, dtService updates | | `TestSCMFailoverProxyProviderRefresh` (this PR) | 3 | SCM swap mechanism | | `TestSCMConnectionManager` (this PR) | 6 (1 prior + 5 new) | resolveLatestAddress edge cases, refreshSCMServer happy-path | | `TestOzoneManagerRatisServer` (this PR) | 6 (5 prior + 1 new) | Ratis address is hostname:port, never IP:port | | Existing regressions verified non-regressed | 34 | TestEndPoint (17), TestOMFailoverProxyProvider (8), TestOMFailovers (1), TestHeartbeatEndpointTask (8) | ### What's still out of scope for this PR - HDFS-14118-style construction-time DNS fan-out (a different problem; round-robin DNS for HA — worth a follow-on JIRA if Ozone deployments need it). - The Ratis quorum-loss exit-0 issue in `SCMStateMachine.close()` (filed separately). - Ratis hostname-only behavioural change for IP-literal `ozone.om.address` configs — kept conditional on operator config rather than gated by `resolve-needed`. Today's `NodeDetails.getRatisHostPortStr()` chains through `getHostName()` which can trigger reverse DNS for IP-literal-configured peers across some JDK vendors. If anyone with deeper Apache Ozone background thinks this needs a defensive IP-detection branch, happy to add one — flagging it explicitly because Round 2 surfaced it as a remaining concern. ### Diff summary ``` 25 files changed, 2260 insertions(+), 56 deletions(-) ``` PR head: `d7aa14e066778a97d37a7171b582326083ab6218` (was `4770df45515`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
