[jira] [Updated] (IGNITE-28751) Refactor TCP Discovery SPI joining node validation

Mikhail Petrov (Jira) Mon, 08 Jun 2026 01:21:07 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-28751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Petrov updated IGNITE-28751:
------------------------------------
    Description: 
Motivation.

The following problems arise during the implementation of the RU mechanism:
1. We need to validate joining nodes and ensure that the cluster does not 
contain nodes with more than two different product versions. 
2. We need to ensure that the cluster contains nodes with only one version when 
the RU process is about to complete.

Currently, join validation logic can be implemented in the 
GridComponent#validateNode() method, which is called for all joining nodes on 
the coordinator.

However, the node join process in the TCP Discovery SPI consists of three 
phases:
1. node validation (see RingMessageWorker#processJoinRequestMessage)
2. node join process (data exchange between the joining node and the cluster) 
(see RingMessageWorker#processNodeAddedMessage / processNodeAddFinishedMessage)
3. node join completion (the node is added to the discovery cache and becomes 
visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)

*First problem* 
While the joining node is in phase 2, the RU processor does not observe it via 
the GridDiscoveryManager#remoteNodes or similar methods and cannot properly 
check the current topology.

*It is proposed* to introduce a new method in DiscoveryManager that returns 
remote nodes by directly querying the underlying Discovery SPI. In fact, we 
have all the necessary mechanisms for this—they are simply not used (see 
DiscoverySpi#getRemoteNodes).

*Second Problem*
To properly validate joining nodes, the RU processor must track nodes that have 
passed the RU processor's validation but are still in Phase 1 (the node is 
validated by another Ignite component).

Currently, a joining node can be forced to leave the cluster after being 
validated by the RU processor. Ignite components are not notified of this (see 
RingMessageWorker#nodeCheckError). As a result, the RU processor
1. validates the new node
2. caches it as a node about to join (so that it is taken into account when 
validating subsequent joining nodes)
3. cannot determine whether the joining node is still in the process of joining 
or has just been kicked out from the cluster

*It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all 
cases where a joining node is forced to leave the cluster due to a validation 
error.








  was:
Motivation.

The following problems arise during the implementation of the RU mechanism:
1. We need to validate joining nodes and ensure that the cluster does not 
contain nodes with more than two different product versions. 
2. We need to ensure that the cluster contains nodes with only one version when 
the RU process is about to complete.

Currently, join validation logic can be implemented in the 
GridComponent#validateNode() method, which is called for all joining nodes on 
the coordinator.

However, the node join process in the TCP Discovery SPI consists of three 
phases:
1. node validation (see RingMessageWorker#processJoinRequestMessage)
2. node join process (data exchange between the joining node and the cluster) 
(see RingMessageWorker#processNodeAddedMessage / processNodeAddFinishedMessage)
3. node join completion (the node is added to the discovery cache and becomes 
visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)

*First problem* 
While the joining node is in phase 2, the RU processor does not observe it via 
the GridDiscoveryManager#remoteNodes or similar methods and cannot properly 
check the current topology.

It is *proposed* to introduce a new method in DiscoveryManager that returns 
remote nodes by directly querying the underlying Discovery SPI. In fact, we 
have all the necessary mechanisms for this—they are simply not used (see 
DiscoverySpi#getRemoteNodes).

*Second Problem*
To properly validate joining nodes, the RU processor must track nodes that have 
passed the RU processor's validation but are still in Phase 1 (the node is 
validated by another Ignite component).

Currently, a joining node can be forced to leave the cluster after being 
validated by the RU processor. Ignite components are not notified of this (see 
RingMessageWorker#nodeCheckError). As a result, the RU processor
1. validates the new node
2. caches it as a node about to join (so that it is taken into account when 
validating subsequent joining nodes)
3. cannot determine whether the joining node is still in the process of joining 
or has just been kicked out from the cluster

*It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all 
cases where a joining node is forced to leave the cluster due to a validation 
error.









> Refactor TCP Discovery SPI joining node validation
> --------------------------------------------------
>
>                 Key: IGNITE-28751
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28751
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Mikhail Petrov
>            Assignee: Mikhail Petrov
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Motivation.
> The following problems arise during the implementation of the RU mechanism:
> 1. We need to validate joining nodes and ensure that the cluster does not 
> contain nodes with more than two different product versions. 
> 2. We need to ensure that the cluster contains nodes with only one version 
> when the RU process is about to complete.
> Currently, join validation logic can be implemented in the 
> GridComponent#validateNode() method, which is called for all joining nodes on 
> the coordinator.
> However, the node join process in the TCP Discovery SPI consists of three 
> phases:
> 1. node validation (see RingMessageWorker#processJoinRequestMessage)
> 2. node join process (data exchange between the joining node and the cluster) 
> (see RingMessageWorker#processNodeAddedMessage / 
> processNodeAddFinishedMessage)
> 3. node join completion (the node is added to the discovery cache and becomes 
> visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)
> *First problem* 
> While the joining node is in phase 2, the RU processor does not observe it 
> via the GridDiscoveryManager#remoteNodes or similar methods and cannot 
> properly check the current topology.
> *It is proposed* to introduce a new method in DiscoveryManager that returns 
> remote nodes by directly querying the underlying Discovery SPI. In fact, we 
> have all the necessary mechanisms for this—they are simply not used (see 
> DiscoverySpi#getRemoteNodes).
> *Second Problem*
> To properly validate joining nodes, the RU processor must track nodes that 
> have passed the RU processor's validation but are still in Phase 1 (the node 
> is validated by another Ignite component).
> Currently, a joining node can be forced to leave the cluster after being 
> validated by the RU processor. Ignite components are not notified of this 
> (see RingMessageWorker#nodeCheckError). As a result, the RU processor
> 1. validates the new node
> 2. caches it as a node about to join (so that it is taken into account when 
> validating subsequent joining nodes)
> 3. cannot determine whether the joining node is still in the process of 
> joining or has just been kicked out from the cluster
> *It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all 
> cases where a joining node is forced to leave the cluster due to a validation 
> error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-28751) Refactor TCP Discovery SPI joining node validation

Reply via email to