kerneltime opened a new pull request, #10486: URL: https://github.com/apache/ozone/pull/10486
## What changes were proposed in this pull request? This is **PR 2 of 4** splitting [HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback). This PR fixes the **Client → OM Hadoop-RPC path** and introduces the shared infrastructure (connection-failure classifier, opt-in flag) that the remaining proxy-provider PRs in this series will reuse. > **Stacked on [#10485](https://github.com/apache/ozone/pull/10485)** (PR-1 of 4 — Ratis peer addresses as hostnames). Please review PR-1 first; that branch is the merge base for this PR. The diff against `master` includes PR-1's hunks until PR-1 lands. ## Why this matters `OMProxyInfo` — the per-OM peer record inside `OMFailoverProxyProvider` — constructs an `InetSocketAddress` from the configured `host:port` once at process start, then re-uses it for the proxy's lifetime. `InetSocketAddress` performs a one-shot DNS lookup at construction and **freezes** the resolved IP. When an OM pod is rescheduled in Kubernetes (or any environment where pod IPs change while DNS names remain stable), every subsequent client RPC dials the now-defunct IP forever. The failover ring walks past the broken peer to the next OM, but the broken peer never recovers without a process restart. Note: the gRPC OM-client variant (`GrpcOMFailoverProxyProvider`) already passes a placeholder `InetSocketAddress(0)` and lets gRPC's `NameResolver` re-resolve hostnames on its own schedule — so it is unaffected by this bug and unchanged in this PR. ## What this PR does ### 1. Shared infrastructure | Artifact | Purpose | |---|---| | `org.apache.hadoop.hdds.utils.ConnectionFailureUtils` | Classifies a `Throwable` (with cause-chain walk bounded to depth 16) as a connection-class failure or not. Connection-class types: `ConnectException`, `SocketTimeoutException`, `NoRouteToHostException`, `UnknownHostException`, `EOFException`, `SocketException`. Application-level errors (`OMException`, `OMNotLeaderException`, `AccessControlException`, `RetryAction`-coded responses) are NOT classified as connection failures, so DNS load is not amplified by logical failures. | | `ozone.client.failover.resolve-needed` | Single opt-in flag for the entire DNS-refresh-on-failure feature, defaulting **`false`** so existing non-K8s deployments see zero behaviour change. Mirrors `dfs.client.failover.resolve-needed` (HDFS-14118) and `hbase.resolve.hostnames.on.failure`. | The flag and classifier ship with this PR but are also used (in subsequent PRs) by the SCM proxy provider and the DN heartbeat task. Landing them with the OM PR follows @szetszwo's "OM first" guidance. ### 2. `OMProxyInfo` becomes refreshable | Change | Why | |---|---| | Preserves the original `host:port` string (`rpcAddrStr`) alongside the resolved `InetSocketAddress` | The string is the source of truth for re-resolution; the `InetSocketAddress` is now a derived cache. | | `refreshAddressIfChanged()` re-resolves `rpcAddrStr` outside the entry monitor, then atomically swaps the cached address, the derived `dtService`, and nulls the cached `proxy` under the monitor | DNS lookup must not happen under any lock (a slow/dead resolver would freeze concurrent readers). Building `dtService` first means a `SecurityUtil.buildTokenService` throw (rare but possible) aborts cleanly with no half-swap. | | Old proxy is stopped via `RPC.stopProxy(staleProxy)` **outside** the monitor | Avoid blocking concurrent readers behind socket teardown. | | `setCachedAddressForTest` (`@VisibleForTesting`) | Test seam to inject a deliberately stale cached address without a real OM peer. | ### 3. `OMFailoverProxyProviderBase.shouldRetry` calls the refresh on connection-class exceptions After the existing `shouldFailover` filter and before the failover-index advance, when the flag is enabled and `ConnectionFailureUtils.isConnectionFailure(exception)` matches, the provider: 1. Calls `OMProxyInfo.refreshAddressIfChanged()` for the current peer. 2. On a successful refresh, returns `FAILOVER_AND_RETRY` but **pins `nextProxyIndex` to the current node** so `RetryInvocationHandler.performFailover()` does NOT advance past the just-refreshed peer. Without this pin, an N-peer HA cluster would skip the now-fixed peer for up to N-1 attempts. ### 4. `HadoopRpcOMFailoverProxyProvider` wiring Pass the preserved hostname string into `OMProxyInfo` at construction. `HadoopRpcOMFollowerReadFailoverProxyProvider` adjusts to match. ## Atomic-replace pattern Every refresh path in this series follows the same ordering: **build the new resource → atomically install it → close the old resource**. Never remove-then-build (a build failure would leave the system without a peer). Never close-then-build (same). For `OMProxyInfo.refreshAddressIfChanged`: 1. Re-resolve `rpcAddrStr` (no lock). 2. Build the new `dtService` (no lock). 3. Capture the stale proxy reference, swap the address/dtService/proxy=null fields under the entry monitor. 4. Stop the stale proxy outside the monitor. ## Secure-cluster prerequisite When `ozone.client.failover.resolve-needed=true` is enabled on a Kerberos-secured cluster, operators **must** also set `hadoop.security.token.service.use_ip=false` in `core-site.xml`. The Hadoop delegation-token service identifier defaults to an `IP:port` string; after a refresh, the per-OM service identifier built from the new IP no longer matches the IP-based service captured on long-lived tokens, and token selection silently fails for the refreshed peer. With `use_ip=false` the service identifier is the stable `hostname:port`, which survives any IP change. Same prerequisite HADOOP-17068 carries. Documented inline in `ozone-default.xml`. ## How was this patch tested? | Test class | Count | Coverage | |---|---|---| | `TestConnectionFailureUtils` (new) | 20 | Filter classification: bare types, `IOException`-wrapped, deeply nested cause chains (3 levels), application-level negative cases, length-2 cause-chain cycles (terminates), 1024-deep non-matching chains (cost bound). | | `TestOMProxyInfoDnsRefresh` (new) | 4 | Per-instance refresh: no-op preserves cached proxy, swap on IP change, rebuilt proxy is dialed with the freshly-resolved address (not a sentinel), `dtService` updates. Uses the `@VisibleForTesting` setter rather than reflection. | | `TestOMFailoverProxyProviderRefreshWired` (new) | 5 | End-to-end retry path. `SocketTimeoutException` triggers `maybeRefreshCurrentOmAddress` (the AWS EC2 silent-drop case end-to-end). `ConnectException` triggers refresh. `OMException` does NOT. Flag-off does NOT. After a successful refresh, `performFailover` stays on the same `nodeId`. | Existing regression suites confirmed non-regressed: `TestOMFailoverProxyProvider` (8), `TestOMFailovers` (1). ## Scope of this PR - **Includes:** Client → OM Hadoop RPC. Includes the shared `ConnectionFailureUtils` and the `ozone.client.failover.resolve-needed` flag. - **Free transitive benefit:** OM ↔ OM control-plane RPC (`OMInterServiceProtocol`) rides on the same `OMFailoverProxyProvider`, so OM-to-OM Hadoop-RPC recovery is also fixed by this PR. - **Out of scope (later PRs):** - **PR-3:** OM → SCM (`SCMFailoverProxyProviderBase`, `SCMProxyInfo`). Same shape, different provider. - **PR-4:** DN → SCM heartbeat (`EndpointStateMachine`, `SCMConnectionManager`, `StateContext`). Brings the `ozone.datanode.scm.heartbeat.address.refresh.threshold` knob. ## What is the link to the Apache JIRA? https://issues.apache.org/jira/browse/HDDS-15514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
