[jira] [Updated] (IGNITE-18685) Do not allow a node excluded from Logical Topology to enter Physical Topology again

Roman Puchkovskiy (Jira) Mon, 06 Feb 2023 00:00:04 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy updated IGNITE-18685:
---------------------------------------
    Description: 
As per IGNITE-18630, a node excluded from Logical Topology (LT) must be 
excluded from Physical Topology (PT).

The following scenario is possible:
 # A node is a part of both PT and LT
 # Its network cable gets unplugged, but the node keeps being alive
 # After proper timeouts, the cluster removes the node from LT (and, hence, PT)
 # The network cable gets plugged again, so the node attempts to enter the PT 
with the same old ID (aka Launch ID)

In such a situation, the node must be refused entry, namely, a connection must 
be terminated on a handshake attempt. This has to be done both in 
{{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}.

When a node is refused a connection attempt, the refusing node must first send 
an explaining message (like 'your ID is stale') and then close the physical 
connection.

The refused node must take measures to refresh its identity (like initiating a 
critical failure using a Failure Handler).

A subtle thing is how we persist the fact that some node ID is stale. For 
starters, we could make this information volatile (only keep it in memory), but 
later we could record this information using CMG.

{*}PS{*}. Please do not confuse this issue with IGNITE-18712 which is a new 
attempt to solve the same problem. Current issue is stale and is planned to be 
closed without a fix soon.

  was:
As per IGNITE-18630, a node excluded from Logical Topology (LT) must be 
excluded from Physical Topology (PT).

The following scenario is possible:
 # A node is a part of both PT and LT
 # Its network cable gets unplugged, but the node keeps being alive
 # After proper timeouts, the cluster removes the node from LT (and, hence, PT)
 # The network cable gets plugged again, so the node attempts to enter the PT 
with the same old ID (aka Launch ID)

In such a situation, the node must be refused entry, namely, a connection must 
be terminated on a handshake attempt. This has to be done both in 
{{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}.

When a node is refused a connection attempt, the refusing node must first send 
an explaining message (like 'your ID is stale') and then close the physical 
connection.

The refused node must take measures to refresh its identity (like initiating a 
critical failure using a Failure Handler).

A subtle thing is how we persist the fact that some node ID is stale. For 
starters, we could make this information volatile (only keep it in memory), but 
later we could record this information using CMG.

Please do not confuse this issue with IGNITE-18712 which is a new attempt to 
solve the same problem. Current issue is stale and is planned to be closed 
without a fix soon.


> Do not allow a node excluded from Logical Topology to enter Physical Topology 
> again
> -----------------------------------------------------------------------------------
>
>                 Key: IGNITE-18685
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18685
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> As per IGNITE-18630, a node excluded from Logical Topology (LT) must be 
> excluded from Physical Topology (PT).
> The following scenario is possible:
>  # A node is a part of both PT and LT
>  # Its network cable gets unplugged, but the node keeps being alive
>  # After proper timeouts, the cluster removes the node from LT (and, hence, 
> PT)
>  # The network cable gets plugged again, so the node attempts to enter the PT 
> with the same old ID (aka Launch ID)
> In such a situation, the node must be refused entry, namely, a connection 
> must be terminated on a handshake attempt. This has to be done both in 
> {{RecoveryServerHandshakeManager}} and {{{}RecoveryClientHandshakeManager{}}}.
> When a node is refused a connection attempt, the refusing node must first 
> send an explaining message (like 'your ID is stale') and then close the 
> physical connection.
> The refused node must take measures to refresh its identity (like initiating 
> a critical failure using a Failure Handler).
> A subtle thing is how we persist the fact that some node ID is stale. For 
> starters, we could make this information volatile (only keep it in memory), 
> but later we could record this information using CMG.
> {*}PS{*}. Please do not confuse this issue with IGNITE-18712 which is a new 
> attempt to solve the same problem. Current issue is stale and is planned to 
> be closed without a fix soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-18685) Do not allow a node excluded from Logical Topology to enter Physical Topology again

Reply via email to