Copilot commented on code in PR #10470:
URL: https://github.com/apache/ozone/pull/10470#discussion_r3383156893


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/ozone/container/common/TestSCMConnectionManagerDnsRefreshE2E.java:
##########
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.container.common;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import java.net.InetAddress;
+import java.net.InetSocketAddress;
+import org.apache.hadoop.hdds.conf.OzoneConfiguration;
+import 
org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos.SCMHeartbeatRequestProto;
+import 
org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos.SCMHeartbeatResponseProto;
+import org.apache.hadoop.ipc_.RPC;
+import 
org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine;
+import 
org.apache.hadoop.ozone.container.common.statemachine.SCMConnectionManager;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.Timeout;
+
+/**
+ * Verifies the swap-mechanism + real-network-handshake half of the
+ * DataNode → SCM DNS-refresh-on-failure recovery path.
+ * <p>
+ * Stands up a real SCM-side RPC server (via {@link ScmTestMock}) bound to a
+ * loopback address on an OS-assigned port, then primes
+ * {@link SCMConnectionManager} with a deliberately stale entry: cached at
+ * {@code 127.0.0.99:port} (no listener, unreachable) but with the preserved
+ * hostname {@code localhost:port}. This is the steady state an Ozone
+ * DataNode falls into when a Kubernetes SCM pod has been rescheduled to a
+ * new IP -- the cached InetSocketAddress points to the gone-away IP, but
+ * DNS for the hostname now resolves elsewhere.
+ * <p>
+ * The test calls {@link SCMConnectionManager#refreshSCMServer} directly
+ * and asserts that:
+ * <ol>
+ *   <li>the swap occurred,</li>
+ *   <li>the cached endpoint now points at the live SCM,</li>
+ *   <li>a real heartbeat through the swapped endpoint round-trips
+ *       successfully (mock SCM observes the count incrementing).</li>
+ * </ol>
+ * Together these prove the swap-mechanism path: address re-resolution,
+ * atomic endpoint swap, fresh RPC proxy construction, real network
+ * handshake, server-side handler invocation.
+ * <p>
+ * The complementary trigger-chain integration -- heartbeat IOException
+ * → catch block → {@code maybeRefreshScmAddress} → flag/threshold check
+ * → {@code refreshSCMServer} -- is covered by
+ * {@code TestHeartbeatEndpointTaskDnsRefresh}. Together the two tests
+ * prove the full production recovery flow without a single flaky
+ * timing-dependent assertion.
+ */
+public class TestSCMConnectionManagerDnsRefreshE2E {
+
+  private RPC.Server scmServer;
+  private SCMConnectionManager connectionManager;
+
+  @AfterEach
+  public void tearDown() throws Exception {
+    if (connectionManager != null) {
+      connectionManager.close();
+    }
+    if (scmServer != null) {
+      scmServer.stop();
+    }
+  }
+
+  @Test
+  @Timeout(value = 30, unit = java.util.concurrent.TimeUnit.SECONDS)
+  public void testRefreshSwapsEndpointAndHeartbeatSucceeds() throws Exception {
+    // Step 1: stand up a real mock SCM RPC server on a loopback address,
+    // OS-assigned port. This is the "live" SCM after the pod restart.
+    OzoneConfiguration conf = new OzoneConfiguration();
+    ScmTestMock scmServerImpl = new ScmTestMock();
+    scmServer = SCMTestUtils.startScmRpcServer(conf, scmServerImpl);
+    InetSocketAddress liveAddr = scmServer.getListenerAddress();
+    int port = liveAddr.getPort();
+
+    // The hostname we'll preserve. localhost reliably resolves to a
+    // loopback address in any test environment, and the server is
+    // bound to a loopback address, so a dial of localhost:port
+    // succeeds.
+    String hostAndPort = "localhost:" + port;

Review Comment:
   This E2E test can be flaky on systems where the SCM RPC server binds to an 
IPv6 loopback address (eg `::1`) but `localhost` re-resolves to an IPv4 
loopback (or vice versa). In that case the post-refresh heartbeat dials the 
wrong loopback interface and fails. Use the listener's actual bound host string 
for `hostAndPort` to keep the refresh target consistent with the server bind 
address.



##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java:
##########
@@ -44,6 +44,7 @@
 import java.util.Set;
 import java.util.UUID;
 import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.CopyOnWriteArraySet;

Review Comment:
   There is an unused `import java.util.HashSet;` left in this file (endpoints 
was switched to `CopyOnWriteArraySet`). Please remove the unused import to 
avoid compilation/checkstyle failures.



##########
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/statemachine/TestSCMConnectionManager.java:
##########
@@ -46,4 +48,178 @@ public void testRemoveSCMServerDoesNotMarkEndpointShutdown()
       Assertions.assertEquals(HEARTBEAT, endpoint.getState());
     }
   }
+
+  /**
+   * resolveLatestAddress() returns null when no preserved hostname is
+   * available -- legacy code path -- so re-resolution is a no-op for that
+   * endpoint. Operator must restart the DN to pick up a new IP.
+   */
+  @Test
+  public void testResolveLatestAddressReturnsNullWithoutHostAndPort() {
+    InetSocketAddress address = new InetSocketAddress("127.0.0.1", 9861);
+    EndpointStateMachine endpoint = new EndpointStateMachine(
+        address, /*hostAndPort=*/ null, /*endPoint=*/ null,
+        new OzoneConfiguration(), "");
+    Assertions.assertNull(endpoint.resolveLatestAddress());
+    Assertions.assertNull(endpoint.getHostAndPort());
+  }
+
+  /**
+   * When the cached IP matches what DNS currently returns for the
+   * preserved hostname, resolveLatestAddress() returns null (no swap
+   * needed). Uses "localhost" because it reliably resolves to a loopback
+   * address in any test environment.
+   */
+  @Test
+  public void testResolveLatestAddressReturnsNullWhenIpUnchanged()
+      throws Exception {
+    InetAddress loopback = InetAddress.getByName("localhost");
+    InetSocketAddress address = new InetSocketAddress(loopback, 9861);
+    EndpointStateMachine endpoint = new EndpointStateMachine(
+        address, "localhost:9861", null, new OzoneConfiguration(), "");
+    InetSocketAddress refreshed = endpoint.resolveLatestAddress();

Review Comment:
   This test can be flaky on systems where `localhost` resolves to a different 
address family (IPv4 vs IPv6) between `InetAddress.getByName("localhost")` and 
`NetUtils.createSocketAddr("localhost:9861")` (used by 
`resolveLatestAddress()`). Construct the cached address via `NetUtils` so the 
"IP unchanged" assertion is stable.



##########
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/statemachine/TestSCMConnectionManager.java:
##########
@@ -46,4 +48,178 @@ public void testRemoveSCMServerDoesNotMarkEndpointShutdown()
       Assertions.assertEquals(HEARTBEAT, endpoint.getState());
     }
   }
+
+  /**
+   * resolveLatestAddress() returns null when no preserved hostname is
+   * available -- legacy code path -- so re-resolution is a no-op for that
+   * endpoint. Operator must restart the DN to pick up a new IP.
+   */
+  @Test
+  public void testResolveLatestAddressReturnsNullWithoutHostAndPort() {
+    InetSocketAddress address = new InetSocketAddress("127.0.0.1", 9861);
+    EndpointStateMachine endpoint = new EndpointStateMachine(
+        address, /*hostAndPort=*/ null, /*endPoint=*/ null,
+        new OzoneConfiguration(), "");
+    Assertions.assertNull(endpoint.resolveLatestAddress());
+    Assertions.assertNull(endpoint.getHostAndPort());
+  }
+
+  /**
+   * When the cached IP matches what DNS currently returns for the
+   * preserved hostname, resolveLatestAddress() returns null (no swap
+   * needed). Uses "localhost" because it reliably resolves to a loopback
+   * address in any test environment.
+   */
+  @Test
+  public void testResolveLatestAddressReturnsNullWhenIpUnchanged()
+      throws Exception {
+    InetAddress loopback = InetAddress.getByName("localhost");
+    InetSocketAddress address = new InetSocketAddress(loopback, 9861);
+    EndpointStateMachine endpoint = new EndpointStateMachine(
+        address, "localhost:9861", null, new OzoneConfiguration(), "");
+    InetSocketAddress refreshed = endpoint.resolveLatestAddress();
+    Assertions.assertNull(refreshed,
+        "localhost re-resolves to the same loopback address; refresh "
+            + "must report no change so the endpoint is not torn down "
+            + "needlessly.");
+  }
+
+  /**
+   * When the cached IP differs from what DNS currently returns for the
+   * preserved hostname, resolveLatestAddress() returns the freshly-
+   * resolved address. Simulates the "SCM pod was rescheduled to a new
+   * IP" scenario by constructing the endpoint with a deliberately stale
+   * cached IP.
+   */
+  @Test
+  public void testResolveLatestAddressReturnsNewAddressOnIpChange()
+      throws Exception {
+    // Pretend localhost previously resolved to 127.0.0.99 (stale IP).
+    // In real Kubernetes this would be the now-defunct pod IP.
+    InetSocketAddress staleAddress = new InetSocketAddress(
+        InetAddress.getByAddress(new byte[] {127, 0, 0, 99}), 9861);
+    EndpointStateMachine endpoint = new EndpointStateMachine(
+        staleAddress, "localhost:9861", null,
+        new OzoneConfiguration(), "");
+    InetSocketAddress refreshed = endpoint.resolveLatestAddress();
+    Assertions.assertNotNull(refreshed,
+        "localhost re-resolves to loopback (typically 127.0.0.1), "
+            + "which differs from the stale 127.0.0.99 we cached; "
+            + "refresh must report the change so the endpoint can "
+            + "swap to the live address.");
+    Assertions.assertEquals(9861, refreshed.getPort());
+    Assertions.assertNotEquals(staleAddress.getAddress(),
+        refreshed.getAddress());
+  }
+
+  /**
+   * refreshSCMServer() swaps an endpoint's address atomically in the
+   * connection manager when the cached IP is stale. The replacement
+   * endpoint starts in GETVERSION state -- the version handshake must
+   * be re-run because the new SCM pod is effectively a fresh process.
+   */
+  @Test
+  public void testRefreshSCMServerSwapsEndpointOnIpChange() throws Exception {
+    try (SCMConnectionManager connectionManager =
+             new SCMConnectionManager(new OzoneConfiguration())) {
+      InetSocketAddress staleAddress = new InetSocketAddress(
+          InetAddress.getByAddress(new byte[] {127, 0, 0, 99}), 9861);
+      connectionManager.addSCMServer(staleAddress, "localhost:9861", "");
+
+      InetSocketAddress refreshed = connectionManager.refreshSCMServer(
+          staleAddress, "");
+
+      Assertions.assertNotNull(refreshed);
+      Assertions.assertEquals(1, connectionManager.getNumOfConnections());
+      EndpointStateMachine swapped =
+          connectionManager.getValues().iterator().next();
+      Assertions.assertEquals(refreshed, swapped.getAddress());
+      Assertions.assertEquals("localhost:9861", swapped.getHostAndPort());
+    }
+  }
+
+  /**
+   * Regression for the rollback gap Copilot flagged on
+   * {@code refreshSCMServer}: if building the replacement endpoint
+   * throws (transient DNS blip during proxy construction, peer not
+   * yet accepting on the new IP, NetUtils refusing the resolved
+   * address), the connection manager must NOT have already removed
+   * the stale endpoint -- otherwise the peer disappears from
+   * {@code scmMachines} and no future heartbeat has anywhere to dial.
+   * <p>
+   * The fix builds the replacement BEFORE removing the stale entry.
+   * This test injects a build that throws and asserts the stale
+   * endpoint is still registered after the call returns its
+   * IOException.
+   */
+  @Test
+  public void testRefreshSCMServerLeavesStaleEndpointOnBuildFailure()
+      throws Exception {
+    final IOException simulated =
+        new IOException("simulated transient build failure");
+    try (SCMConnectionManager connectionManager =
+        new SCMConnectionManager(new OzoneConfiguration()) {
+          private boolean firstBuildDone;
+
+          @Override
+          EndpointStateMachine buildScmEndpoint(InetSocketAddress address,
+              String hostAndPort, String threadNamePrefix)
+              throws IOException {
+            if (!firstBuildDone) {
+              firstBuildDone = true;
+              return super.buildScmEndpoint(address, hostAndPort,
+                  threadNamePrefix);
+            }
+            throw simulated;
+          }
+        }) {
+      InetSocketAddress staleAddress = new InetSocketAddress(
+          InetAddress.getByAddress(new byte[] {127, 0, 0, 99}), 9861);
+      connectionManager.addSCMServer(staleAddress, "localhost:9861", "");
+      EndpointStateMachine before =
+          connectionManager.getValues().iterator().next();
+
+      IOException thrown = Assertions.assertThrows(IOException.class,
+          () -> connectionManager.refreshSCMServer(staleAddress, ""));
+      Assertions.assertSame(simulated, thrown,
+          "the underlying build failure must be propagated, not "
+              + "swallowed by an enclosing recovery branch");
+
+      Assertions.assertEquals(1, connectionManager.getNumOfConnections(),
+          "stale endpoint must remain registered when buildScmEndpoint "
+              + "throws -- otherwise the peer disappears from the "
+              + "connection manager and no heartbeat can recover it");
+      Assertions.assertSame(before,
+          connectionManager.getValues().iterator().next(),
+          "the SAME EndpointStateMachine instance must still be "
+              + "registered (not a half-constructed replacement)");
+    }
+  }
+
+  /**
+   * refreshSCMServer() against an endpoint whose cached IP already matches
+   * DNS is a no-op -- the existing endpoint stays in place untouched. This
+   * prevents needless tearing-down of healthy connections when the
+   * heartbeat task asks to refresh after a transient blip.
+   */
+  @Test
+  public void testRefreshSCMServerNoopWhenIpUnchanged() throws Exception {
+    try (SCMConnectionManager connectionManager =
+             new SCMConnectionManager(new OzoneConfiguration())) {
+      InetAddress loopback = InetAddress.getByName("localhost");
+      InetSocketAddress address = new InetSocketAddress(loopback, 9861);
+      connectionManager.addSCMServer(address, "localhost:9861", "");

Review Comment:
   Same flakiness risk as the earlier resolveLatestAddress test: 
`InetAddress.getByName("localhost")` may pick a different IPv4/IPv6 loopback 
than `NetUtils.createSocketAddr("localhost:9861")` used by the refresh path, 
causing an unexpected swap. Build the cached address via `NetUtils` to keep 
address selection consistent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to