kerneltime commented on code in PR #10470:
URL: https://github.com/apache/ozone/pull/10470#discussion_r3381736016
##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java:
##########
@@ -165,14 +180,61 @@ public void addSCMServer(InetSocketAddress address,
rpcProxy);
EndpointStateMachine endPoint = new EndpointStateMachine(address,
- rpcClient, this.conf, threadNamePrefix);
+ hostAndPort, rpcClient, this.conf, threadNamePrefix);
endPoint.setPassive(false);
scmMachines.put(address, endPoint);
} finally {
writeUnlock();
}
}
+ /**
+ * Re-resolve the SCM hostname for the endpoint at {@code oldAddress} and,
+ * if the resolved IP has changed, atomically replace the endpoint with a
+ * fresh one bound to the new address.
+ * <p>
+ * Returns the new {@link InetSocketAddress} on a successful swap, or null
+ * if no swap occurred (endpoint not found, hostname not preserved at
+ * construction, IP unchanged, or DNS lookup failed). Callers receive
+ * enough information to update any external maps keyed by the old
+ * address.
+ * <p>
+ * The replacement is built via the existing {@link #addSCMServer} path,
+ * so the new endpoint starts in {@code GETVERSION} state and re-walks
+ * version → register → heartbeat. This is the correct behavior: a peer
+ * that has been rescheduled is effectively a fresh process.
+ */
+ public InetSocketAddress refreshSCMServer(InetSocketAddress oldAddress,
+ String threadNamePrefix) throws IOException {
+ writeLock();
+ try {
+ EndpointStateMachine existing = scmMachines.get(oldAddress);
+ if (existing == null) {
+ return null;
+ }
+ InetSocketAddress refreshed = existing.resolveLatestAddress();
+ if (refreshed == null) {
+ return null;
+ }
+ // Tear down before adding to avoid two endpoints claiming the same
+ // logical SCM peer.
+ String hostAndPort = existing.getHostAndPort();
+ scmMachines.remove(oldAddress);
+ try {
+ existing.close();
+ } catch (RuntimeException closeEx) {
+ LOG.warn("Failed to close stale endpoint {}: {}", oldAddress,
+ closeEx.getMessage());
+ }
+ addSCMServer(refreshed, hostAndPort, threadNamePrefix);
+ LOG.info("DNS re-resolution: SCM endpoint {} -> {} (host {}).",
+ oldAddress, refreshed, hostAndPort);
+ return refreshed;
Review Comment:
Good catch — fixed in `642dd17a`.
Factored the proxy/endpoint construction out of `addSCMServer` into a
private `buildScmEndpoint(...)`. `refreshSCMServer` now calls
`buildScmEndpoint(refreshed, ...)` BEFORE removing the stale entry. If
construction throws (transient DNS during `RPC.getProtocolProxy`, `NetUtils`
refusing the new address, peer not yet accepting), the existing endpoint stays
registered in `scmMachines` and the IOException is propagated to the caller.
Heartbeats continue against the stale IP — better than the peer disappearing
entirely from the connection manager.
New regression test
`TestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailure`
injects a build failure via a `@VisibleForTesting`-overridable hook and
asserts (a) the IOException propagates, (b) `getNumOfConnections()` is still 1,
(c) the SAME `EndpointStateMachine` instance is still registered (not a
half-constructed replacement). 21/21 in `TestSCMConnectionManager` + 6/6 in
`TestHeartbeatEndpointTaskDnsRefresh` pass.
##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMProxyInfo.java:
##########
@@ -43,9 +43,26 @@ public final class OMProxyInfo<T> extends ProxyInfo<T> {
private static final Logger LOG = LoggerFactory.getLogger(OMProxyInfo.class);
private final String nodeId;
+ /**
+ * The original "host:port" config string. Stable for the lifetime of
+ * this OMProxyInfo; used as the source of truth for re-resolving DNS
+ * when the cached IP becomes stale (Kubernetes pod-IP-change recovery).
+ */
private final String rpcAddrStr;
- private final InetSocketAddress rpcAddr;
- private final Text dtService;
+ /**
+ * The currently-resolved address. Initialized at construction by
+ * resolving {@link #rpcAddrStr}, and may be replaced atomically by
+ * {@link #refreshAddressIfChanged()} when the failover provider
+ * detects that the SCM peer has been rescheduled to a new IP.
Review Comment:
Fixed in `642dd17a` — copy-paste artefact from the SCM-side cousin. Now
reads "OM node has been rescheduled to a new IP."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]