kerneltime opened a new pull request, #10473: URL: https://github.com/apache/ozone/pull/10473
## What changes were proposed in this pull request? This PR addresses [HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514): Ozone DataNodes, OzoneManagers, and clients cannot recover from a peer's IP changing under a stable hostname. The cached `InetSocketAddress` is frozen for the lifetime of the process, so when an SCM/OM pod is rescheduled in Kubernetes (or any environment where pod IPs change while DNS names remain stable), every subsequent RPC dials the now-defunct IP forever. Recovery today requires a process restart or an external operator that watches pod IPs and force-restarts dependent components. This is the same class of bug HADOOP-17068 fixed for the HDFS NameNode HA client. This PR mirrors that pattern at the `FailoverProxyProvider` / `EndpointStateMachine` layer in Ozone — one tier above where Hadoop applied the fix, because Ozone's RPC seams live there — for the four inter-component Hadoop-RPC paths, and removes the IP-baking from the two Ratis paths so gRPC's `DnsNameResolver` can re-resolve hostnames on its own. The new behaviour is **opt-in**, default `false`, gated by `ozone.client.failover.resolve-needed`. Existing non-K8s deployments see zero behaviour change. This matches the precedent set by: - `dfs.client.failover.resolve-needed` (HADOOP-17068 / HDFS-14118) - `hbase.resolve.hostnames.on.failure` (HBase `ConnectionImplementation`) ## Solution overview **The root cause is `InetSocketAddress`.** Every Ozone RPC path constructs an `InetSocketAddress` once (typically at process start) and reuses it across the lifetime of the proxy. `InetSocketAddress` performs a one-shot DNS lookup at construction and freezes the resolved IP. When the IP behind the hostname changes, no amount of retrying helps — the cached object always dials the old IP. **The fix has two ingredients:** 1. **Preserve the original `host:port` string.** Every site that constructs an `InetSocketAddress` from a configured value now also stores the configured string. The string is the source of truth for re-resolution; the `InetSocketAddress` is a derived cache. 2. **Re-resolve and rebuild on connection-class failures.** When an RPC fails with an exception that suggests the cached IP is unreachable (`ConnectException`, `SocketTimeoutException`, `NoRouteToHostException`, `UnknownHostException`, `EOFException`, `SocketException`), the failover layer re-resolves the preserved hostname. If DNS now returns a different IP, the cached `InetSocketAddress` is replaced atomically and the underlying RPC proxy is rebuilt against the new IP. The Ratis paths get a different fix: instead of refresh-on-failure, they pass hostname strings (not resolved `InetSocketAddress` instances) to `RaftPeer.setAddress`. Ratis hands these strings to gRPC, whose default `NameResolver` re-resolves on its own schedule. ## Per-path mechanism | Path | Mechanism | |---|---| | **DN → SCM heartbeat** | `EndpointStateMachine` preserves a `hostAndPort` string. After a connection-class `IOException` from `sendHeartbeat`, if the DN's missed-heartbeat counter has reached `ozone.datanode.scm.heartbeat.address.refresh.threshold` (default 3), `HeartbeatEndpointTask` asks `SCMConnectionManager.refreshSCMServer` to re-resolve. On a successful swap, the connection manager builds a fresh `EndpointStateMachine` bound to the new IP **before** removing the stale entry (atomic-replace pattern), commits the swap under the write lock, releases the lock, and closes the stale endpoint outside the lock. `StateContext.migrateEndpoint` then re-keys the per-endpoint queues (incremental reports, container actions, pipeline actions) so in-flight reports survive the swap. | | **OM → SCM** | `SCMProxyInfo` retains the config-time `host:port`. `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId)` runs in `shouldRetry` when the exception chain contains a connection-class type. The DNS lookup happens **outside** the provider monitor; the swap is committed under the lock with a re-check that defends against lost updates. The stale proxy is stopped via `RPC.stopProxy` outside the lock. | | **Client → OM (Hadoop RPC)** | `OMProxyInfo.refreshAddressIfChanged()` re-resolves outside the entry monitor, swaps `rpcAddr` / `dtService` / `proxy=null` atomically inside, and stops the stale proxy outside. `OMFailoverProxyProviderBase.shouldRetry` invokes the refresh on connection-class exceptions; on successful refresh, it pins `nextProxyIndex` to the current node so `RetryInvocationHandler.performFailover()` does NOT advance the failover ring past the just-refreshed peer. | | **Client → OM (gRPC)** | No code change. `GrpcOMFailoverProxyProvider` passes a placeholder `InetSocketAddress(0)` and lets gRPC's `NameResolver` re-resolve hostnames on its own schedule. | | **OM ↔ OM control plane** | Uses Hadoop RPC via `OMInterServiceProtocol`. Recovers transitively via the Client → OM machinery above. | | **OM ↔ OM Ratis replication** | `OzoneManagerRatisServer.createRaftPeer` always passes `omNode.getRatisHostPortStr()` (a hostname string) to `RaftPeer.setAddress` — never a resolved IP. Two of three pre-PR `createRaftPeer` branches called `new InetSocketAddress(omNode.getInetAddress(), ratisPort)`, which baked the resolved IP into Ratis's peer state. With hostname-only addresses, gRPC's default `DnsNameResolver` (used by Ratis under the hood) re-resolves on connection failure. **No Ratis upstream change required.** | | **SCM ↔ SCM Ratis replication** | Already used hostname strings; PR removes a misleading `// TODO : Should we use IP instead of hostname??` comment in `SCMRatisServerImpl.buildRaftGroup` and `SCMHAManagerImpl` and replaces with explanatory comments. | ### Connection-class exception filter Re-resolution is gated on exception types where DNS could plausibly help, classified by `org.apache.hadoop.hdds.utils.ConnectionFailureUtils.isConnectionFailure`: - `java.net.ConnectException` — connection refused / unreachable (OpenStack fast-RST) - `java.net.SocketTimeoutException` — silent packet drop (the dominant failure shape on AWS EC2 / EKS) - `java.net.NoRouteToHostException` — host route gone - `java.net.UnknownHostException` — DNS lookup failed downstream - `java.io.EOFException` — load balancer or iptables RST closed the connection cleanly - `java.net.SocketException` — RST mid-stream The classifier walks the cause chain bounded to depth 16 to defend against pathological cycles. Application-level errors (`OMException`, `OMNotLeaderException`, `AccessControlException`, `RetryAction`-coded responses) are explicitly excluded so DNS load is not amplified by logical failures. ## Design choices and rationale ### Atomic-replace pattern at every swap site Every refresh path follows the same ordering: **build the new resource → atomically install it → close the old resource**. Never remove-then-build (a build failure would leave the registry empty for that key). Never close-then-build (a build failure would leave the system with neither old nor new resource). The pattern is visible in: - `SCMConnectionManager.refreshSCMServer`: build replacement endpoint before swapping `scmMachines`. - `OMProxyInfo.refreshAddressIfChanged`: capture stale proxy reference, swap fields under monitor, call `RPC.stopProxy(stale)` outside the monitor. - `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged`: build new `SCMProxyInfo` outside lock, re-check under lock, swap, stop stale outside lock. ### No I/O inside critical sections DNS lookups, RPC proxy construction, and proxy teardown are all blocking. None happen inside `synchronized` monitors or `writeLock()` regions in the new code. `SCMConnectionManager.refreshSCMServer` is split into 4 phases: 1. Read lock: snapshot the endpoint reference and preserved `hostAndPort`. 2. No lock: DNS lookup via `resolveLatestAddress()`. 3. Write lock: re-check the snapshot is current (defends against concurrent removeSCMServer/refresh races), enforce collision invariant, build replacement, commit swap. 4. No lock: close stale endpoint. This pattern prevents a slow / dead resolver from stalling unrelated heartbeats and reconfiguration paths. ### Failover ring pinning after refresh-success When the failover provider successfully refreshes the current peer's IP, `shouldRetry` returns `FAILOVER_AND_RETRY` but explicitly pins `nextProxyIndex` (OM) / `updatedLeaderNodeID` (SCM) to the current peer. Without this, the surrounding `RetryInvocationHandler.performFailover` would walk to the next peer in the ring, bypassing the just-fixed peer for up to N-1 attempts in an N-peer HA cluster. ### Endpoint queue migration preserves the "endpoint ⇒ queue" invariant `StateContext.migrateEndpoint` re-keys per-endpoint queues (incremental reports, container actions, pipeline actions, full-report flags) when an endpoint is swapped. The migration ordering preserves the invariant that every endpoint in the `endpoints` set has a corresponding queue at every observable point: 1. **PUBLISH**: install new-key queues alongside the old-key queues. 2. **SWITCH**: add `newEndpoint` to the endpoints set; remove `oldEndpoint` from the endpoints set. 3. **RETIRE**: drop the old-key queues (no producer can reach them after step 2). Without this ordering, a producer iterating the `endpoints` set could see the old endpoint key, look up its queue, find it removed, and silently drop the report. `endpoints` is now a `CopyOnWriteArraySet` (was `HashSet`); `incrementalReportsQueue`, `containerActions`, `pipelineActions`, and `isFullReportReadyToBeSent` are now `ConcurrentHashMap` (or already were). Producers also null-skip queue lookups as defense-in-depth. ### Collision handling If the freshly-resolved IP collides with another already-registered SCM peer key (e.g., transient kube-dns returning peer-B's IP for peer-A's hostname), `refreshSCMServer` refuses the swap rather than overwriting the colliding peer. The stale endpoint stays registered and the next heartbeat retries DNS. ### Secure-cluster prerequisite When `ozone.client.failover.resolve-needed=true` is enabled on a Kerberos-secured cluster, operators MUST also set `hadoop.security.token.service.use_ip=false` in `core-site.xml`. The Hadoop delegation-token service identifier defaults to an `IP:port` string; after a refresh, the per-OM service identifier built from the new IP no longer matches the IP-based service captured on long-lived tokens, and token selection silently fails for the refreshed peer. With `use_ip=false` the service identifier is the stable `hostname:port`, which survives any IP change. This is the same prerequisite HADOOP-17068 carries. Documented inline in `ozone-default.xml` on the `ozone.client.failover.resolve-needed` entry. ## How was this patch tested? 86 tests across the touched modules, all passing under `mvn clean test`: | Test class | Count | Coverage | |---|---|---| | `TestConnectionFailureUtils` (new) | 20 | Filter classification: bare + `IOException`-wrapped, deeply nested chains (3 levels), application-level negative cases, length-2 cause-chain cycles, 1024-deep non-matching chains (cost bound). | | `TestOMFailoverProxyProviderRefreshWired` (new) | 5 | Wired retry path. `SocketTimeoutException` triggers `maybeRefreshCurrentOmAddress` (the EC2 silent-drop case end-to-end, not just helper-in-isolation). `ConnectException` triggers refresh. `OMException` does NOT. Flag-off does NOT. After a successful refresh, `performFailover` stays on the same `nodeId`. | | `TestHeartbeatEndpointTaskDnsRefresh` (new) | 6 | Production trigger chain. `HeartbeatEndpointTask.call()` catch block fires `refreshSCMServer` only when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) `host:port` preserved. `AccessControlException` at threshold does NOT trigger. After a successful swap, `StateContext`'s incremental-reports map has the new key and not the old key. | | `TestSCMConnectionManagerDnsRefreshE2E` (new) | 1 | Real-RPC swap mechanism with `@Timeout(30)`. Stands up a real `ScmTestMock` RPC server, primes the connection manager with a stale `127.0.0.99` cache + preserved hostname, calls `refreshSCMServer`, asserts a real `sendHeartbeat` round-trips through the swapped endpoint. | | `TestOMProxyInfoDnsRefresh` (new, expanded) | 4 | Per-instance refresh: no-op preserves the cached proxy, swap on IP change, rebuilt proxy is dialed with the freshly-resolved address (not a sentinel), `dtService` updates. | | `TestSCMFailoverProxyProviderRefresh` (new) | 3 | SCM swap mechanism: swap on IP change, no-op when unchanged, no-op without preserved `host:port`. | | `TestSCMConnectionManager` (extended) | 7 (1 prior + 6 new) | `resolveLatestAddress` edge cases. `refreshSCMServer` happy-path swap. No-op when `host:port` not preserved. Rollback regression: when `buildScmEndpoint` throws, the stale endpoint remains registered (uses a `@VisibleForTesting` overridable hook to inject the failure). | | `TestOzoneManagerRatisServer` (extended) | 6 (5 prior + 1 new) | `RaftPeer.getAddress()` is a `hostname:port` string, never an IP-format string. | Existing regression suites verified non-regressed: `TestEndPoint` (17), `TestOMFailoverProxyProvider` (8), `TestOMFailovers` (1), `TestHeartbeatEndpointTask` (8). ## Scope and known limitations - **DN initial bringup with stale DNS**: the refresh fires from the `HEARTBEAT` phase via `HeartbeatEndpointTask`. If a DataNode starts up with the SCM peer already at a stale IP and never reaches `HEARTBEAT`, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of [HDDS-5919](https://issues.apache.org/jira/browse/HDDS-5919)'s `ozone.network.jvm.address.cache.enabled=false`. `InitDatanodeState.java` already postpones initialization on initial-resolution failure. - **HDFS-14118-style construction-time DNS fan-out** (one hostname → multiple persistent IPs, for round-robin DNS HA) is a different problem and out of scope. Worth a follow-on JIRA if needed. - **Ratis quorum-loss exit-0** in `SCMStateMachine.close()` (calls `ExitUtils.terminate(0, ...)` when leader election fails to converge, leading to Kubernetes CrashLoopBackOff death spirals) is a separate concern worth its own JIRA. - **Reverse-DNS edge case**: when `ozone.om.address` is configured as an IP literal (rather than a hostname), `NodeDetails.getRatisHostPortStr()` chains through `getHostName()` which can trigger PTR lookup on some JDKs. Operators are expected to configure SCM/OM addresses as DNS names in K8s deployments where this PR's behaviour is enabled. - **Operator-facing markdown documentation** for the new flags lives only in `ozone-default.xml`. Worth a separate docs PR. ## What is the link to the Apache JIRA? https://issues.apache.org/jira/browse/HDDS-15514 ## References - [HADOOP-17068](https://issues.apache.org/jira/browse/HADOOP-17068): client fails forever when namenode ipaddr changed (Hadoop 3.4.0). The model for this PR. - [HDFS-14118](https://issues.apache.org/jira/browse/HDFS-14118): introduces `dfs.client.failover.resolve-needed` and the `AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary` hook. - HBase: `hbase.resolve.hostnames.on.failure` (`ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY`). - ZOOKEEPER-1506, ZOOKEEPER-2982: ZK `StaticHostProvider` re-resolves on each `next()` call. - [HDDS-5919](https://issues.apache.org/jira/browse/HDDS-5919): introduces `ozone.network.jvm.address.cache.enabled` (default `true`). JVM-level DNS cache TTL — necessary but not sufficient for the long-lived `InetSocketAddress` problem this PR fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
