[ https://issues.apache.org/jira/browse/IGNITE-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Puchkovskiy updated IGNITE-18685: --------------------------------------- Description: As per IGNITE-18630, a node excluded from Logical Topology (LT) must be excluded from Physical Topology (PT). The following scenario is possible: # A node is a part of both PT and LT # Its network cable gets unplugged, but the node keeps being alive # After proper timeouts, the cluster removes the node from LT (and, hence, PT) # The network cable gets plugged again, so the node attempts to enter the PT with the same old ID (aka Launch ID) In such a situation, the node must be refused entry, namely, a connection must be terminated on a handshake attempt. This has to be done both in {{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}. When a node is refused a connection attempt, the refusing node must first send an explaining message (like 'your ID is stale') and then close the physical connection. The refused node must take measures to refresh its identity (like initiating a critical failure using a Failure Handler). A subtle thing is how we persist the fact that some node ID is stale. For starters, we could make this information volatile (only keep it in memory), but later we could record this information using CMG. {*}PS{*}. Please do not confuse this issue with IGNITE-18712 which is a new attempt to solve the same problem. Current issue is stale and is planned to be closed without a fix soon. was: As per IGNITE-18630, a node excluded from Logical Topology (LT) must be excluded from Physical Topology (PT). The following scenario is possible: # A node is a part of both PT and LT # Its network cable gets unplugged, but the node keeps being alive # After proper timeouts, the cluster removes the node from LT (and, hence, PT) # The network cable gets plugged again, so the node attempts to enter the PT with the same old ID (aka Launch ID) In such a situation, the node must be refused entry, namely, a connection must be terminated on a handshake attempt. This has to be done both in {{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}. When a node is refused a connection attempt, the refusing node must first send an explaining message (like 'your ID is stale') and then close the physical connection. The refused node must take measures to refresh its identity (like initiating a critical failure using a Failure Handler). A subtle thing is how we persist the fact that some node ID is stale. For starters, we could make this information volatile (only keep it in memory), but later we could record this information using CMG. Please do not confuse this issue with IGNITE-18712 which is a new attempt to solve the same problem. Current issue is stale and is planned to be closed without a fix soon. > Do not allow a node excluded from Logical Topology to enter Physical Topology > again > ----------------------------------------------------------------------------------- > > Key: IGNITE-18685 > URL: https://issues.apache.org/jira/browse/IGNITE-18685 > Project: Ignite > Issue Type: Improvement > Reporter: Roman Puchkovskiy > Assignee: Roman Puchkovskiy > Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > > As per IGNITE-18630, a node excluded from Logical Topology (LT) must be > excluded from Physical Topology (PT). > The following scenario is possible: > # A node is a part of both PT and LT > # Its network cable gets unplugged, but the node keeps being alive > # After proper timeouts, the cluster removes the node from LT (and, hence, > PT) > # The network cable gets plugged again, so the node attempts to enter the PT > with the same old ID (aka Launch ID) > In such a situation, the node must be refused entry, namely, a connection > must be terminated on a handshake attempt. This has to be done both in > {{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}. > When a node is refused a connection attempt, the refusing node must first > send an explaining message (like 'your ID is stale') and then close the > physical connection. > The refused node must take measures to refresh its identity (like initiating > a critical failure using a Failure Handler). > A subtle thing is how we persist the fact that some node ID is stale. For > starters, we could make this information volatile (only keep it in memory), > but later we could record this information using CMG. > {*}PS{*}. Please do not confuse this issue with IGNITE-18712 which is a new > attempt to solve the same problem. Current issue is stale and is planned to > be closed without a fix soon. -- This message was sent by Atlassian Jira (v8.20.10#820010)