[PR] HDDS-15514. DNS refresh on connection failure for Client to OM [ozone]

via GitHub Wed, 10 Jun 2026 22:06:47 -0700


kerneltime opened a new pull request, #10486:
URL: https://github.com/apache/ozone/pull/10486


   ## What changes were proposed in this pull request?
   
   This is **PR 2 of 4** splitting 
[HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally 
proposed as a single ~160KB patch in #10473, split per @szetszwo's review 
feedback).
   
   This PR fixes the **Client → OM Hadoop-RPC path** and introduces the shared 
infrastructure (connection-failure classifier, opt-in flag) that the remaining 
proxy-provider PRs in this series will reuse.
   
   > **Stacked on [#10485](https://github.com/apache/ozone/pull/10485)** (PR-1 
of 4 — Ratis peer addresses as hostnames). Please review PR-1 first; that 
branch is the merge base for this PR. The diff against `master` includes PR-1's 
hunks until PR-1 lands.
   
   ## Why this matters
   
   `OMProxyInfo` — the per-OM peer record inside `OMFailoverProxyProvider` — 
constructs an `InetSocketAddress` from the configured `host:port` once at 
process start, then re-uses it for the proxy's lifetime. `InetSocketAddress` 
performs a one-shot DNS lookup at construction and **freezes** the resolved IP. 
When an OM pod is rescheduled in Kubernetes (or any environment where pod IPs 
change while DNS names remain stable), every subsequent client RPC dials the 
now-defunct IP forever. The failover ring walks past the broken peer to the 
next OM, but the broken peer never recovers without a process restart.
   
   Note: the gRPC OM-client variant (`GrpcOMFailoverProxyProvider`) already 
passes a placeholder `InetSocketAddress(0)` and lets gRPC's `NameResolver` 
re-resolve hostnames on its own schedule — so it is unaffected by this bug and 
unchanged in this PR.
   
   ## What this PR does
   
   ### 1. Shared infrastructure
   
   | Artifact | Purpose |
   |---|---|
   | `org.apache.hadoop.hdds.utils.ConnectionFailureUtils` | Classifies a 
`Throwable` (with cause-chain walk bounded to depth 16) as a connection-class 
failure or not. Connection-class types: `ConnectException`, 
`SocketTimeoutException`, `NoRouteToHostException`, `UnknownHostException`, 
`EOFException`, `SocketException`. Application-level errors (`OMException`, 
`OMNotLeaderException`, `AccessControlException`, `RetryAction`-coded 
responses) are NOT classified as connection failures, so DNS load is not 
amplified by logical failures. |
   | `ozone.client.failover.resolve-needed` | Single opt-in flag for the entire 
DNS-refresh-on-failure feature, defaulting **`false`** so existing non-K8s 
deployments see zero behaviour change. Mirrors 
`dfs.client.failover.resolve-needed` (HDFS-14118) and 
`hbase.resolve.hostnames.on.failure`. |
   
   The flag and classifier ship with this PR but are also used (in subsequent 
PRs) by the SCM proxy provider and the DN heartbeat task. Landing them with the 
OM PR follows @szetszwo's "OM first" guidance.
   
   ### 2. `OMProxyInfo` becomes refreshable
   
   | Change | Why |
   |---|---|
   | Preserves the original `host:port` string (`rpcAddrStr`) alongside the 
resolved `InetSocketAddress` | The string is the source of truth for 
re-resolution; the `InetSocketAddress` is now a derived cache. |
   | `refreshAddressIfChanged()` re-resolves `rpcAddrStr` outside the entry 
monitor, then atomically swaps the cached address, the derived `dtService`, and 
nulls the cached `proxy` under the monitor | DNS lookup must not happen under 
any lock (a slow/dead resolver would freeze concurrent readers). Building 
`dtService` first means a `SecurityUtil.buildTokenService` throw (rare but 
possible) aborts cleanly with no half-swap. |
   | Old proxy is stopped via `RPC.stopProxy(staleProxy)` **outside** the 
monitor | Avoid blocking concurrent readers behind socket teardown. |
   | `setCachedAddressForTest` (`@VisibleForTesting`) | Test seam to inject a 
deliberately stale cached address without a real OM peer. |
   
   ### 3. `OMFailoverProxyProviderBase.shouldRetry` calls the refresh on 
connection-class exceptions
   
   After the existing `shouldFailover` filter and before the failover-index 
advance, when the flag is enabled and 
`ConnectionFailureUtils.isConnectionFailure(exception)` matches, the provider:
   
   1. Calls `OMProxyInfo.refreshAddressIfChanged()` for the current peer.
   2. On a successful refresh, returns `FAILOVER_AND_RETRY` but **pins 
`nextProxyIndex` to the current node** so 
`RetryInvocationHandler.performFailover()` does NOT advance past the 
just-refreshed peer. Without this pin, an N-peer HA cluster would skip the 
now-fixed peer for up to N-1 attempts.
   
   ### 4. `HadoopRpcOMFailoverProxyProvider` wiring
   
   Pass the preserved hostname string into `OMProxyInfo` at construction. 
`HadoopRpcOMFollowerReadFailoverProxyProvider` adjusts to match.
   
   ## Atomic-replace pattern
   
   Every refresh path in this series follows the same ordering: **build the new 
resource → atomically install it → close the old resource**. Never 
remove-then-build (a build failure would leave the system without a peer). 
Never close-then-build (same). For `OMProxyInfo.refreshAddressIfChanged`:
   
   1. Re-resolve `rpcAddrStr` (no lock).
   2. Build the new `dtService` (no lock).
   3. Capture the stale proxy reference, swap the address/dtService/proxy=null 
fields under the entry monitor.
   4. Stop the stale proxy outside the monitor.
   
   ## Secure-cluster prerequisite
   
   When `ozone.client.failover.resolve-needed=true` is enabled on a 
Kerberos-secured cluster, operators **must** also set 
`hadoop.security.token.service.use_ip=false` in `core-site.xml`. The Hadoop 
delegation-token service identifier defaults to an `IP:port` string; after a 
refresh, the per-OM service identifier built from the new IP no longer matches 
the IP-based service captured on long-lived tokens, and token selection 
silently fails for the refreshed peer. With `use_ip=false` the service 
identifier is the stable `hostname:port`, which survives any IP change. Same 
prerequisite HADOOP-17068 carries. Documented inline in `ozone-default.xml`.
   
   ## How was this patch tested?
   
   | Test class | Count | Coverage |
   |---|---|---|
   | `TestConnectionFailureUtils` (new) | 20 | Filter classification: bare 
types, `IOException`-wrapped, deeply nested cause chains (3 levels), 
application-level negative cases, length-2 cause-chain cycles (terminates), 
1024-deep non-matching chains (cost bound). |
   | `TestOMProxyInfoDnsRefresh` (new) | 4 | Per-instance refresh: no-op 
preserves cached proxy, swap on IP change, rebuilt proxy is dialed with the 
freshly-resolved address (not a sentinel), `dtService` updates. Uses the 
`@VisibleForTesting` setter rather than reflection. |
   | `TestOMFailoverProxyProviderRefreshWired` (new) | 5 | End-to-end retry 
path. `SocketTimeoutException` triggers `maybeRefreshCurrentOmAddress` (the AWS 
EC2 silent-drop case end-to-end). `ConnectException` triggers refresh. 
`OMException` does NOT. Flag-off does NOT. After a successful refresh, 
`performFailover` stays on the same `nodeId`. |
   
   Existing regression suites confirmed non-regressed: 
`TestOMFailoverProxyProvider` (8), `TestOMFailovers` (1).
   
   ## Scope of this PR
   
   - **Includes:** Client → OM Hadoop RPC. Includes the shared 
`ConnectionFailureUtils` and the `ozone.client.failover.resolve-needed` flag.
   - **Free transitive benefit:** OM ↔ OM control-plane RPC 
(`OMInterServiceProtocol`) rides on the same `OMFailoverProxyProvider`, so 
OM-to-OM Hadoop-RPC recovery is also fixed by this PR.
   - **Out of scope (later PRs):**
     - **PR-3:** OM → SCM (`SCMFailoverProxyProviderBase`, `SCMProxyInfo`). 
Same shape, different provider.
     - **PR-4:** DN → SCM heartbeat (`EndpointStateMachine`, 
`SCMConnectionManager`, `StateContext`). Brings the 
`ozone.datanode.scm.heartbeat.address.refresh.threshold` knob.
   
   ## What is the link to the Apache JIRA?
   
   https://issues.apache.org/jira/browse/HDDS-15514
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15514. DNS refresh on connection failure for Client to OM [ozone]

Reply via email to