[PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

via GitHub Tue, 09 Jun 2026 12:12:07 -0700


kerneltime opened a new pull request, #10473:
URL: https://github.com/apache/ozone/pull/10473


   ## What changes were proposed in this pull request?
   
   This PR addresses 
[HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514): Ozone 
DataNodes, OzoneManagers, and clients cannot recover from a peer's IP changing 
under a stable hostname. The cached `InetSocketAddress` is frozen for the 
lifetime of the process, so when an SCM/OM pod is rescheduled in Kubernetes (or 
any environment where pod IPs change while DNS names remain stable), every 
subsequent RPC dials the now-defunct IP forever. Recovery today requires a 
process restart or an external operator that watches pod IPs and force-restarts 
dependent components.
   
   This is the same class of bug HADOOP-17068 fixed for the HDFS NameNode HA 
client. This PR mirrors that pattern at the `FailoverProxyProvider` / 
`EndpointStateMachine` layer in Ozone — one tier above where Hadoop applied the 
fix, because Ozone's RPC seams live there — for the four inter-component 
Hadoop-RPC paths, and removes the IP-baking from the two Ratis paths so gRPC's 
`DnsNameResolver` can re-resolve hostnames on its own.
   
   The new behaviour is **opt-in**, default `false`, gated by 
`ozone.client.failover.resolve-needed`. Existing non-K8s deployments see zero 
behaviour change. This matches the precedent set by:
   - `dfs.client.failover.resolve-needed` (HADOOP-17068 / HDFS-14118)
   - `hbase.resolve.hostnames.on.failure` (HBase `ConnectionImplementation`)
   
   ## Solution overview
   
   **The root cause is `InetSocketAddress`.** Every Ozone RPC path constructs 
an `InetSocketAddress` once (typically at process start) and reuses it across 
the lifetime of the proxy. `InetSocketAddress` performs a one-shot DNS lookup 
at construction and freezes the resolved IP. When the IP behind the hostname 
changes, no amount of retrying helps — the cached object always dials the old 
IP.
   
   **The fix has two ingredients:**
   
   1. **Preserve the original `host:port` string.** Every site that constructs 
an `InetSocketAddress` from a configured value now also stores the configured 
string. The string is the source of truth for re-resolution; the 
`InetSocketAddress` is a derived cache.
   
   2. **Re-resolve and rebuild on connection-class failures.** When an RPC 
fails with an exception that suggests the cached IP is unreachable 
(`ConnectException`, `SocketTimeoutException`, `NoRouteToHostException`, 
`UnknownHostException`, `EOFException`, `SocketException`), the failover layer 
re-resolves the preserved hostname. If DNS now returns a different IP, the 
cached `InetSocketAddress` is replaced atomically and the underlying RPC proxy 
is rebuilt against the new IP.
   
   The Ratis paths get a different fix: instead of refresh-on-failure, they 
pass hostname strings (not resolved `InetSocketAddress` instances) to 
`RaftPeer.setAddress`. Ratis hands these strings to gRPC, whose default 
`NameResolver` re-resolves on its own schedule.
   
   ## Per-path mechanism
   
   | Path | Mechanism |
   |---|---|
   | **DN → SCM heartbeat** | `EndpointStateMachine` preserves a `hostAndPort` 
string. After a connection-class `IOException` from `sendHeartbeat`, if the 
DN's missed-heartbeat counter has reached 
`ozone.datanode.scm.heartbeat.address.refresh.threshold` (default 3), 
`HeartbeatEndpointTask` asks `SCMConnectionManager.refreshSCMServer` to 
re-resolve. On a successful swap, the connection manager builds a fresh 
`EndpointStateMachine` bound to the new IP **before** removing the stale entry 
(atomic-replace pattern), commits the swap under the write lock, releases the 
lock, and closes the stale endpoint outside the lock. 
`StateContext.migrateEndpoint` then re-keys the per-endpoint queues 
(incremental reports, container actions, pipeline actions) so in-flight reports 
survive the swap. |
   | **OM → SCM** | `SCMProxyInfo` retains the config-time `host:port`. 
`SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId)` runs in 
`shouldRetry` when the exception chain contains a connection-class type. The 
DNS lookup happens **outside** the provider monitor; the swap is committed 
under the lock with a re-check that defends against lost updates. The stale 
proxy is stopped via `RPC.stopProxy` outside the lock. |
   | **Client → OM (Hadoop RPC)** | `OMProxyInfo.refreshAddressIfChanged()` 
re-resolves outside the entry monitor, swaps `rpcAddr` / `dtService` / 
`proxy=null` atomically inside, and stops the stale proxy outside. 
`OMFailoverProxyProviderBase.shouldRetry` invokes the refresh on 
connection-class exceptions; on successful refresh, it pins `nextProxyIndex` to 
the current node so `RetryInvocationHandler.performFailover()` does NOT advance 
the failover ring past the just-refreshed peer. |
   | **Client → OM (gRPC)** | No code change. `GrpcOMFailoverProxyProvider` 
passes a placeholder `InetSocketAddress(0)` and lets gRPC's `NameResolver` 
re-resolve hostnames on its own schedule. |
   | **OM ↔ OM control plane** | Uses Hadoop RPC via `OMInterServiceProtocol`. 
Recovers transitively via the Client → OM machinery above. |
   | **OM ↔ OM Ratis replication** | `OzoneManagerRatisServer.createRaftPeer` 
always passes `omNode.getRatisHostPortStr()` (a hostname string) to 
`RaftPeer.setAddress` — never a resolved IP. Two of three pre-PR 
`createRaftPeer` branches called `new 
InetSocketAddress(omNode.getInetAddress(), ratisPort)`, which baked the 
resolved IP into Ratis's peer state. With hostname-only addresses, gRPC's 
default `DnsNameResolver` (used by Ratis under the hood) re-resolves on 
connection failure. **No Ratis upstream change required.** |
   | **SCM ↔ SCM Ratis replication** | Already used hostname strings; PR 
removes a misleading `// TODO : Should we use IP instead of hostname??` comment 
in `SCMRatisServerImpl.buildRaftGroup` and `SCMHAManagerImpl` and replaces with 
explanatory comments. |
   
   ### Connection-class exception filter
   
   Re-resolution is gated on exception types where DNS could plausibly help, 
classified by 
`org.apache.hadoop.hdds.utils.ConnectionFailureUtils.isConnectionFailure`:
   
   - `java.net.ConnectException` — connection refused / unreachable (OpenStack 
fast-RST)
   - `java.net.SocketTimeoutException` — silent packet drop (the dominant 
failure shape on AWS EC2 / EKS)
   - `java.net.NoRouteToHostException` — host route gone
   - `java.net.UnknownHostException` — DNS lookup failed downstream
   - `java.io.EOFException` — load balancer or iptables RST closed the 
connection cleanly
   - `java.net.SocketException` — RST mid-stream
   
   The classifier walks the cause chain bounded to depth 16 to defend against 
pathological cycles. Application-level errors (`OMException`, 
`OMNotLeaderException`, `AccessControlException`, `RetryAction`-coded 
responses) are explicitly excluded so DNS load is not amplified by logical 
failures.
   
   ## Design choices and rationale
   
   ### Atomic-replace pattern at every swap site
   
   Every refresh path follows the same ordering: **build the new resource → 
atomically install it → close the old resource**. Never remove-then-build (a 
build failure would leave the registry empty for that key). Never 
close-then-build (a build failure would leave the system with neither old nor 
new resource). The pattern is visible in:
   
   - `SCMConnectionManager.refreshSCMServer`: build replacement endpoint before 
swapping `scmMachines`.
   - `OMProxyInfo.refreshAddressIfChanged`: capture stale proxy reference, swap 
fields under monitor, call `RPC.stopProxy(stale)` outside the monitor.
   - `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged`: build new 
`SCMProxyInfo` outside lock, re-check under lock, swap, stop stale outside lock.
   
   ### No I/O inside critical sections
   
   DNS lookups, RPC proxy construction, and proxy teardown are all blocking. 
None happen inside `synchronized` monitors or `writeLock()` regions in the new 
code. `SCMConnectionManager.refreshSCMServer` is split into 4 phases:
   
   1. Read lock: snapshot the endpoint reference and preserved `hostAndPort`.
   2. No lock: DNS lookup via `resolveLatestAddress()`.
   3. Write lock: re-check the snapshot is current (defends against concurrent 
removeSCMServer/refresh races), enforce collision invariant, build replacement, 
commit swap.
   4. No lock: close stale endpoint.
   
   This pattern prevents a slow / dead resolver from stalling unrelated 
heartbeats and reconfiguration paths.
   
   ### Failover ring pinning after refresh-success
   
   When the failover provider successfully refreshes the current peer's IP, 
`shouldRetry` returns `FAILOVER_AND_RETRY` but explicitly pins `nextProxyIndex` 
(OM) / `updatedLeaderNodeID` (SCM) to the current peer. Without this, the 
surrounding `RetryInvocationHandler.performFailover` would walk to the next 
peer in the ring, bypassing the just-fixed peer for up to N-1 attempts in an 
N-peer HA cluster.
   
   ### Endpoint queue migration preserves the "endpoint ⇒ queue" invariant
   
   `StateContext.migrateEndpoint` re-keys per-endpoint queues (incremental 
reports, container actions, pipeline actions, full-report flags) when an 
endpoint is swapped. The migration ordering preserves the invariant that every 
endpoint in the `endpoints` set has a corresponding queue at every observable 
point:
   
   1. **PUBLISH**: install new-key queues alongside the old-key queues.
   2. **SWITCH**: add `newEndpoint` to the endpoints set; remove `oldEndpoint` 
from the endpoints set.
   3. **RETIRE**: drop the old-key queues (no producer can reach them after 
step 2).
   
   Without this ordering, a producer iterating the `endpoints` set could see 
the old endpoint key, look up its queue, find it removed, and silently drop the 
report. `endpoints` is now a `CopyOnWriteArraySet` (was `HashSet`); 
`incrementalReportsQueue`, `containerActions`, `pipelineActions`, and 
`isFullReportReadyToBeSent` are now `ConcurrentHashMap` (or already were). 
Producers also null-skip queue lookups as defense-in-depth.
   
   ### Collision handling
   
   If the freshly-resolved IP collides with another already-registered SCM peer 
key (e.g., transient kube-dns returning peer-B's IP for peer-A's hostname), 
`refreshSCMServer` refuses the swap rather than overwriting the colliding peer. 
The stale endpoint stays registered and the next heartbeat retries DNS.
   
   ### Secure-cluster prerequisite
   
   When `ozone.client.failover.resolve-needed=true` is enabled on a 
Kerberos-secured cluster, operators MUST also set 
`hadoop.security.token.service.use_ip=false` in `core-site.xml`. The Hadoop 
delegation-token service identifier defaults to an `IP:port` string; after a 
refresh, the per-OM service identifier built from the new IP no longer matches 
the IP-based service captured on long-lived tokens, and token selection 
silently fails for the refreshed peer. With `use_ip=false` the service 
identifier is the stable `hostname:port`, which survives any IP change. This is 
the same prerequisite HADOOP-17068 carries. Documented inline in 
`ozone-default.xml` on the `ozone.client.failover.resolve-needed` entry.
   
   ## How was this patch tested?
   
   86 tests across the touched modules, all passing under `mvn clean test`:
   
   | Test class | Count | Coverage |
   |---|---|---|
   | `TestConnectionFailureUtils` (new) | 20 | Filter classification: bare + 
`IOException`-wrapped, deeply nested chains (3 levels), application-level 
negative cases, length-2 cause-chain cycles, 1024-deep non-matching chains 
(cost bound). |
   | `TestOMFailoverProxyProviderRefreshWired` (new) | 5 | Wired retry path. 
`SocketTimeoutException` triggers `maybeRefreshCurrentOmAddress` (the EC2 
silent-drop case end-to-end, not just helper-in-isolation). `ConnectException` 
triggers refresh. `OMException` does NOT. Flag-off does NOT. After a successful 
refresh, `performFailover` stays on the same `nodeId`. |
   | `TestHeartbeatEndpointTaskDnsRefresh` (new) | 6 | Production trigger 
chain. `HeartbeatEndpointTask.call()` catch block fires `refreshSCMServer` only 
when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) 
`host:port` preserved. `AccessControlException` at threshold does NOT trigger. 
After a successful swap, `StateContext`'s incremental-reports map has the new 
key and not the old key. |
   | `TestSCMConnectionManagerDnsRefreshE2E` (new) | 1 | Real-RPC swap 
mechanism with `@Timeout(30)`. Stands up a real `ScmTestMock` RPC server, 
primes the connection manager with a stale `127.0.0.99` cache + preserved 
hostname, calls `refreshSCMServer`, asserts a real `sendHeartbeat` round-trips 
through the swapped endpoint. |
   | `TestOMProxyInfoDnsRefresh` (new, expanded) | 4 | Per-instance refresh: 
no-op preserves the cached proxy, swap on IP change, rebuilt proxy is dialed 
with the freshly-resolved address (not a sentinel), `dtService` updates. |
   | `TestSCMFailoverProxyProviderRefresh` (new) | 3 | SCM swap mechanism: swap 
on IP change, no-op when unchanged, no-op without preserved `host:port`. |
   | `TestSCMConnectionManager` (extended) | 7 (1 prior + 6 new) | 
`resolveLatestAddress` edge cases. `refreshSCMServer` happy-path swap. No-op 
when `host:port` not preserved. Rollback regression: when `buildScmEndpoint` 
throws, the stale endpoint remains registered (uses a `@VisibleForTesting` 
overridable hook to inject the failure). |
   | `TestOzoneManagerRatisServer` (extended) | 6 (5 prior + 1 new) | 
`RaftPeer.getAddress()` is a `hostname:port` string, never an IP-format string. 
|
   
   Existing regression suites verified non-regressed: `TestEndPoint` (17), 
`TestOMFailoverProxyProvider` (8), `TestOMFailovers` (1), 
`TestHeartbeatEndpointTask` (8).
   
   ## Scope and known limitations
   
   - **DN initial bringup with stale DNS**: the refresh fires from the 
`HEARTBEAT` phase via `HeartbeatEndpointTask`. If a DataNode starts up with the 
SCM peer already at a stale IP and never reaches `HEARTBEAT`, the recovery path 
does not engage. Initial-bringup DNS staleness is the existing concern of 
[HDDS-5919](https://issues.apache.org/jira/browse/HDDS-5919)'s 
`ozone.network.jvm.address.cache.enabled=false`. `InitDatanodeState.java` 
already postpones initialization on initial-resolution failure.
   - **HDFS-14118-style construction-time DNS fan-out** (one hostname → 
multiple persistent IPs, for round-robin DNS HA) is a different problem and out 
of scope. Worth a follow-on JIRA if needed.
   - **Ratis quorum-loss exit-0** in `SCMStateMachine.close()` (calls 
`ExitUtils.terminate(0, ...)` when leader election fails to converge, leading 
to Kubernetes CrashLoopBackOff death spirals) is a separate concern worth its 
own JIRA.
   - **Reverse-DNS edge case**: when `ozone.om.address` is configured as an IP 
literal (rather than a hostname), `NodeDetails.getRatisHostPortStr()` chains 
through `getHostName()` which can trigger PTR lookup on some JDKs. Operators 
are expected to configure SCM/OM addresses as DNS names in K8s deployments 
where this PR's behaviour is enabled.
   - **Operator-facing markdown documentation** for the new flags lives only in 
`ozone-default.xml`. Worth a separate docs PR.
   
   ## What is the link to the Apache JIRA?
   
   https://issues.apache.org/jira/browse/HDDS-15514
   
   ## References
   
   - [HADOOP-17068](https://issues.apache.org/jira/browse/HADOOP-17068): client 
fails forever when namenode ipaddr changed (Hadoop 3.4.0). The model for this 
PR.
   - [HDFS-14118](https://issues.apache.org/jira/browse/HDFS-14118): introduces 
`dfs.client.failover.resolve-needed` and the 
`AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary` hook.
   - HBase: `hbase.resolve.hostnames.on.failure` 
(`ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY`).
   - ZOOKEEPER-1506, ZOOKEEPER-2982: ZK `StaticHostProvider` re-resolves on 
each `next()` call.
   - [HDDS-5919](https://issues.apache.org/jira/browse/HDDS-5919): introduces 
`ozone.network.jvm.address.cache.enabled` (default `true`). JVM-level DNS cache 
TTL — necessary but not sufficient for the long-lived `InetSocketAddress` 
problem this PR fixes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

Reply via email to