kerneltime opened a new pull request, #10485: URL: https://github.com/apache/ozone/pull/10485
## What changes were proposed in this pull request? This is **PR 1 of 4** splitting [HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback). This PR fixes the Ratis-replication paths only: - **OM ↔ OM Ratis replication**: `OzoneManagerRatisServer.createRaftPeer` now always passes a `hostname:port` string to `RaftPeer.setAddress(...)`, never a resolved `InetSocketAddress`. Two of the three pre-existing `createRaftPeer` branches called `new InetSocketAddress(omNode.getInetAddress(), ratisPort)`, baking the resolved IP into `RaftPeer.address`. The two overloads collapse into one. - **SCM ↔ SCM Ratis replication**: comment-only — replaces the misleading `// TODO : Should we use IP instead of hostname??` markers in `SCMRatisServerImpl.buildRaftGroup` and `SCMHAManagerImpl#start` with explanatory comments citing HDDS-15514. Both call sites already passed `getRatisHostPortStr()`; the TODO falsely implied IPs would be a fine alternative. ## Why this matters Ratis builds gRPC channels via `NettyChannelBuilder.forTarget(address)`. The default `DnsNameResolver` re-resolves hostnames on connection failure, so a peer-pod restart in Kubernetes (where the pod's IP changes but the DNS name stays stable) is recovered automatically — **as long as `RaftPeer.address` is a hostname**. If the address is a numeric IP (because Ozone resolved it at construction and passed the resolved form to Ratis), the channel stays bound to the now-defunct IP forever and the only recovery is a parent-process restart. ## How was this patch tested? - Unit test: `TestOzoneManagerRatisServer.testCreateRaftPeerUsesHostnameAddress` — asserts that `RaftPeer.getAddress()` for an OM peer is the literal `hostname:port` string and is not an IPv4 numeric form. This guards the invariant against future regressions that re-introduce `InetSocketAddress` at this seam. - Existing `TestOzoneManagerRatisServer` tests (5 prior + 1 new) all pass. - `mvn clean install` of `hadoop-ozone/ozone-manager` and `hadoop-hdds/server-scm` modules. ## Scope of this PR - Pure hostname-string discipline. No new flags, no new exception classifier, no atomic-replace machinery, no `RPC.stopProxy` calls. Those arrive with the next three PRs. - Zero behavior change in non-K8s deployments where DNS-to-IP is stable for the process lifetime. ## Follow-up PRs Per the [4-PR split](https://github.com/apache/ozone/pull/10473#issuecomment-4677191510): 2. **HDDS-15514. DNS refresh on connection failure for Client → OM** — introduces `ConnectionFailureUtils`, `ozone.client.failover.resolve-needed` flag, `OMProxyInfo.refreshAddressIfChanged`, and the `OMFailoverProxyProviderBase.shouldRetry` hook. 3. **HDDS-15514. DNS refresh on connection failure for OM → SCM** — same shape against `SCMFailoverProxyProviderBase` / `SCMProxyInfo`. 4. **HDDS-15514. DNS refresh on heartbeat failure for DN → SCM** — `EndpointStateMachine.resolveLatestAddress`, `SCMConnectionManager.refreshSCMServer` (4-phase atomic-replace), `StateContext.migrateEndpoint`, and the `ozone.datanode.scm.heartbeat.address.refresh.threshold` knob. ## What is the link to the Apache JIRA? https://issues.apache.org/jira/browse/HDDS-15514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
