kerneltime opened a new pull request, #10485:
URL: https://github.com/apache/ozone/pull/10485

   ## What changes were proposed in this pull request?
   
   This is **PR 1 of 4** splitting 
[HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally 
proposed as a single ~160KB patch in #10473, split per @szetszwo's review 
feedback).
   
   This PR fixes the Ratis-replication paths only:
   
   - **OM ↔ OM Ratis replication**: `OzoneManagerRatisServer.createRaftPeer` 
now always passes a `hostname:port` string to `RaftPeer.setAddress(...)`, never 
a resolved `InetSocketAddress`. Two of the three pre-existing `createRaftPeer` 
branches called `new InetSocketAddress(omNode.getInetAddress(), ratisPort)`, 
baking the resolved IP into `RaftPeer.address`. The two overloads collapse into 
one.
   - **SCM ↔ SCM Ratis replication**: comment-only — replaces the misleading 
`// TODO : Should we use IP instead of hostname??` markers in 
`SCMRatisServerImpl.buildRaftGroup` and `SCMHAManagerImpl#start` with 
explanatory comments citing HDDS-15514. Both call sites already passed 
`getRatisHostPortStr()`; the TODO falsely implied IPs would be a fine 
alternative.
   
   ## Why this matters
   
   Ratis builds gRPC channels via `NettyChannelBuilder.forTarget(address)`. The 
default `DnsNameResolver` re-resolves hostnames on connection failure, so a 
peer-pod restart in Kubernetes (where the pod's IP changes but the DNS name 
stays stable) is recovered automatically — **as long as `RaftPeer.address` is a 
hostname**. If the address is a numeric IP (because Ozone resolved it at 
construction and passed the resolved form to Ratis), the channel stays bound to 
the now-defunct IP forever and the only recovery is a parent-process restart.
   
   ## How was this patch tested?
   
   - Unit test: 
`TestOzoneManagerRatisServer.testCreateRaftPeerUsesHostnameAddress` — asserts 
that `RaftPeer.getAddress()` for an OM peer is the literal `hostname:port` 
string and is not an IPv4 numeric form. This guards the invariant against 
future regressions that re-introduce `InetSocketAddress` at this seam.
   - Existing `TestOzoneManagerRatisServer` tests (5 prior + 1 new) all pass.
   - `mvn clean install` of `hadoop-ozone/ozone-manager` and 
`hadoop-hdds/server-scm` modules.
   
   ## Scope of this PR
   
   - Pure hostname-string discipline. No new flags, no new exception 
classifier, no atomic-replace machinery, no `RPC.stopProxy` calls. Those arrive 
with the next three PRs.
   - Zero behavior change in non-K8s deployments where DNS-to-IP is stable for 
the process lifetime.
   
   ## Follow-up PRs
   
   Per the [4-PR 
split](https://github.com/apache/ozone/pull/10473#issuecomment-4677191510):
   
   2. **HDDS-15514. DNS refresh on connection failure for Client → OM** — 
introduces `ConnectionFailureUtils`, `ozone.client.failover.resolve-needed` 
flag, `OMProxyInfo.refreshAddressIfChanged`, and the 
`OMFailoverProxyProviderBase.shouldRetry` hook.
   3. **HDDS-15514. DNS refresh on connection failure for OM → SCM** — same 
shape against `SCMFailoverProxyProviderBase` / `SCMProxyInfo`.
   4. **HDDS-15514. DNS refresh on heartbeat failure for DN → SCM** — 
`EndpointStateMachine.resolveLatestAddress`, 
`SCMConnectionManager.refreshSCMServer` (4-phase atomic-replace), 
`StateContext.migrateEndpoint`, and the 
`ozone.datanode.scm.heartbeat.address.refresh.threshold` knob.
   
   ## What is the link to the Apache JIRA?
   
   https://issues.apache.org/jira/browse/HDDS-15514
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to