sashapolo commented on code in PR #7248:
URL: https://github.com/apache/ignite-3/pull/7248#discussion_r2675350366


##########
modules/network-api/src/main/java/org/apache/ignite/internal/network/JoinedNodes.java:
##########
@@ -26,13 +26,15 @@ public interface JoinedNodes {
      * Called when the node joins logical topology.
      *
      * @param node Node.
+     * @param topologyVersion Logical topology version.
      */
-    void onJoined(InternalClusterNode node);
+    void onJoined(InternalClusterNode node, long topologyVersion);

Review Comment:
   Not related to this PR, but this class' name is awful, would be nice if we 
could rename it



##########
modules/network-api/src/main/java/org/apache/ignite/internal/network/AbstractTopologyService.java:
##########
@@ -28,6 +28,8 @@ public abstract class AbstractTopologyService implements 
TopologyService {
     /** Registered event handlers. */
     private final Collection<TopologyEventHandler> eventHandlers = new 
CopyOnWriteArrayList<>();
 
+    protected long topologyVersion;

Review Comment:
   How do you guarantee visibility of this field? All other fields in this and 
related classes allow concurrent access



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/message/StaleNodeHandlingParams.java:
##########
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.network.recovery.message;
+
+/**
+ * Parameters required for handling stale state of the node.
+ */
+public interface StaleNodeHandlingParams {

Review Comment:
   I personally prefer `Paratemers` instead of `Params`, but this is obviously 
a matter of taste



##########
modules/network-api/src/main/java/org/apache/ignite/internal/network/AbstractTopologyService.java:
##########
@@ -28,6 +28,8 @@ public abstract class AbstractTopologyService implements 
TopologyService {
     /** Registered event handlers. */
     private final Collection<TopologyEventHandler> eventHandlers = new 
CopyOnWriteArrayList<>();
 
+    protected long topologyVersion;

Review Comment:
   Also, I personally don't like having mutable protected fields. I would 
suggest to move this field into `ScaleCubeTopologyService`



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryInitiatorHandshakeManager.java:
##########
@@ -380,6 +379,8 @@ private static boolean clusterIdMismatch(@Nullable UUID 
acceptorClusterId, @Null
     }
 
     private void handleStaleAcceptorId(HandshakeStartMessage msg) {
+        maybeFailOnStaleNodeDetection(failureProcessor, new 
StaleNodeHandlingParamsImpl(topologyService), msg);

Review Comment:
   Same here



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryInitiatorHandshakeManager.java:
##########
@@ -508,6 +505,17 @@ private void handshake(RecoveryDescriptor descriptor) {
         });
     }
 
+    protected HandshakeStartResponseMessage 
createHandshakeStartResponseMessage(RecoveryDescriptor descriptor) {
+        StaleNodeHandlingParamsImpl params = new 
StaleNodeHandlingParamsImpl(topologyService);

Review Comment:
   Same question about this object



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryInitiatorHandshakeManager.java:
##########
@@ -535,4 +543,5 @@ protected void finishHandshake() {
     void setRemoteNode(InternalClusterNode remoteNode) {
         this.remoteNode = remoteNode;
     }
+

Review Comment:
   ```suggestion
   ```



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/message/StaleNodeHandlingParams.java:
##########
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.network.recovery.message;
+
+/**
+ * Parameters required for handling stale state of the node.

Review Comment:
   ```suggestion
    * Parameters required for handling stale state of a node.
   ```



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryAcceptorHandshakeManager.java:
##########
@@ -257,6 +283,8 @@ private boolean 
possiblyRejectHandshakeStartResponse(HandshakeStartResponseMessa
     }
 
     private void handleStaleInitiatorId(HandshakeStartResponseMessage msg) {
+        maybeFailOnStaleNodeDetection(failureProcessor, new 
StaleNodeHandlingParamsImpl(topologyService), msg);

Review Comment:
   Should we send the response and only then fail the node?



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/HandshakeManagerUtils.java:
##########
@@ -88,4 +92,19 @@ static Exception 
createExceptionFromRejectionMessage(HandshakeRejectedMessage ms
                 ? new RecipientLeftException(msg.message())
                 : new HandshakeException(msg.message());
     }
+
+    static void maybeFailOnStaleNodeDetection(
+            FailureProcessor failureProcessor,
+            StaleNodeHandlingParams local,
+            StaleNodeHandlingParams remote
+    ) {
+        long localTopologyVersion = local.topologyVersion();
+        long remoteTopologyVersion = remote.topologyVersion();
+
+        if (localTopologyVersion >= remoteTopologyVersion) {
+            return;
+        }
+
+        failureProcessor.process(new 
FailureContext(FailureType.CRITICAL_ERROR, null, "Node is segmented."));

Review Comment:
   I think we need to print more information, rather than `"Node is 
segmented."`. Let's write something `"Node is unable to join the cluster 
because its handshake has been rejected, possibly because this node suffered 
from a network segmentation, local topology version = blabla, remote topology 
version = blabla"`



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/HandshakeManagerUtils.java:
##########
@@ -88,4 +92,19 @@ static Exception 
createExceptionFromRejectionMessage(HandshakeRejectedMessage ms
                 ? new RecipientLeftException(msg.message())
                 : new HandshakeException(msg.message());
     }
+
+    static void maybeFailOnStaleNodeDetection(
+            FailureProcessor failureProcessor,
+            StaleNodeHandlingParams local,
+            StaleNodeHandlingParams remote
+    ) {
+        long localTopologyVersion = local.topologyVersion();
+        long remoteTopologyVersion = remote.topologyVersion();
+
+        if (localTopologyVersion >= remoteTopologyVersion) {
+            return;
+        }
+
+        failureProcessor.process(new 
FailureContext(FailureType.CRITICAL_ERROR, null, "Node is segmented."));

Review Comment:
   Not related to this PR, but rather a general question. Do we still use a 
no-op failure handler? What does the current behavior look like when a 
partition happens, does the offending node just get stuck unable to connect to 
anybody?



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryAcceptorHandshakeManager.java:
##########
@@ -192,6 +206,18 @@ public void onConnectionOpen() {
         });
     }
 
+    protected HandshakeStartMessage createHandshakeStartMessage() {
+        StaleNodeHandlingParamsImpl params = new 
StaleNodeHandlingParamsImpl(topologyService);

Review Comment:
   Why do you need this object?



##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryAcceptorHandshakeManager.java:
##########
@@ -192,6 +206,18 @@ public void onConnectionOpen() {
         });
     }
 
+    protected HandshakeStartMessage createHandshakeStartMessage() {

Review Comment:
   ```suggestion
       private HandshakeStartMessage createHandshakeStartMessage() {
   ```



##########
modules/network/src/integrationTest/java/org/apache/ignite/internal/network/node/ItNodeStalenessAndRestartTest.java:
##########
@@ -59,13 +72,36 @@ void nodeStalenessStatusIsClearedOnRestart() throws 
Exception {
         );
     }
 
+    @Test
+    @ConfigOverride(name = "ignite.failureHandler.handler.type", value = 
"stop")
+    @MuteFailureManagerLogging
+    void staleNodeIsShutDown() throws Exception {
+        IgniteImpl ignite0 = unwrapIgniteImpl(cluster.node(0));
+
+        LogInspector logInspector = new LogInspector(
+                FailureManager.class.getName(),
+                evt -> evt.getLevel() == Level.ERROR
+                        && 
evt.getMessage().getFormattedMessage().contains(FAILURE_MESSAGE)
+                        && 
Thread.currentThread().getName().contains(cluster.nodeName(1))
+        );
+
+        logInspector.start();
+        try {
+            simulateNetworkPartition(ignite0);
+
+            await().timeout(10, SECONDS).until(logInspector::isMatched);

Review Comment:
   Is it possible to enable the node failing manager and just check that the 
target node has been stopped?



##########
modules/network/src/integrationTest/java/org/apache/ignite/internal/network/node/ItNodeStalenessAndRestartTest.java:
##########
@@ -21,17 +21,29 @@
 import static 
org.apache.ignite.internal.ConfigTemplates.FAST_FAILURE_DETECTION_NODE_BOOTSTRAP_CFG_TEMPLATE;
 import static org.apache.ignite.internal.TestWrappers.unwrapIgniteImpl;
 import static 
org.apache.ignite.internal.testframework.IgniteTestUtils.waitForCondition;
+import static org.awaitility.Awaitility.await;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
 import java.util.concurrent.CountDownLatch;
 import org.apache.ignite.internal.ClusterPerTestIntegrationTest;
+import org.apache.ignite.internal.ConfigOverride;
 import org.apache.ignite.internal.app.IgniteImpl;
+import org.apache.ignite.internal.failure.FailureManager;
 import org.apache.ignite.internal.network.InternalClusterNode;
 import org.apache.ignite.internal.network.TopologyEventHandler;
 import org.apache.ignite.internal.network.message.ScaleCubeMessage;
+import 
org.apache.ignite.internal.testframework.failure.FailureManagerExtension;
+import 
org.apache.ignite.internal.testframework.failure.MuteFailureManagerLogging;
+import org.apache.ignite.internal.testframework.log4j2.LogInspector;
+import org.apache.logging.log4j.Level;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.ExtendWith;
 
+@ExtendWith(FailureManagerExtension.class)
 class ItNodeStalenessAndRestartTest extends ClusterPerTestIntegrationTest {
+
+    private static final String FAILURE_MESSAGE = "Ignite node is in invalid 
state due to a critical failure.";

Review Comment:
   It would be nice to check for a particular exception message, not just a 
general one from the failure manager



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to