[
https://issues.apache.org/jira/browse/IGNITE-28751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Petrov updated IGNITE-28751:
------------------------------------
Description:
Motivation.
The following problems arise during the implementation of the RU mechanism:
1. We need to validate joining nodes and ensure that the cluster does not
contain nodes with more than two different product versions.
2. We need to ensure that the cluster contains nodes with only one version when
the RU process is about to complete.
Currently, join validation logic can be implemented in the
GridComponent#validateNode() method, which is called for all joining nodes on
the coordinator.
However, the node join process in the TCP Discovery SPI consists of three
phases:
1. node validation (see RingMessageWorker#processJoinRequestMessage)
2. node join process (data exchange between the joining node and the cluster)
(see RingMessageWorker#processNodeAddedMessage / processNodeAddFinishedMessage)
3. node join completion (the node is added to the discovery cache and becomes
visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)
*First problem*
While the joining node is in phase 2, the RU processor does not observe it via
the GridDiscoveryManager#remoteNodes or similar methods and cannot properly
check the current topology.
*It is proposed* to introduce a new method in DiscoveryManager that returns
remote nodes by directly querying the underlying Discovery SPI. In fact, we
have all the necessary mechanisms for this—they are simply not used (see
DiscoverySpi#getRemoteNodes).
*Second Problem*
To properly validate joining nodes, the RU processor must track nodes that have
passed the RU processor's validation but are still in Phase 1 (the node is
validated by another Ignite component).
Currently, a joining node can be forced to leave the cluster after being
validated by the RU processor. Ignite components are not notified of this (see
RingMessageWorker#nodeCheckError). As a result, the RU processor
1. validates the new node
2. caches it as a node about to join (so that it is taken into account when
validating subsequent joining nodes)
3. cannot determine whether the joining node is still in the process of joining
or has just been kicked out from the cluster
*It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all
cases where a joining node is forced to leave the cluster due to a validation
error.
was:
Motivation.
The following problems arise during the implementation of the RU mechanism:
1. We need to validate joining nodes and ensure that the cluster does not
contain nodes with more than two different product versions.
2. We need to ensure that the cluster contains nodes with only one version when
the RU process is about to complete.
Currently, join validation logic can be implemented in the
GridComponent#validateNode() method, which is called for all joining nodes on
the coordinator.
However, the node join process in the TCP Discovery SPI consists of three
phases:
1. node validation (see RingMessageWorker#processJoinRequestMessage)
2. node join process (data exchange between the joining node and the cluster)
(see RingMessageWorker#processNodeAddedMessage / processNodeAddFinishedMessage)
3. node join completion (the node is added to the discovery cache and becomes
visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)
*First problem*
While the joining node is in phase 2, the RU processor does not observe it via
the GridDiscoveryManager#remoteNodes or similar methods and cannot properly
check the current topology.
It is *proposed* to introduce a new method in DiscoveryManager that returns
remote nodes by directly querying the underlying Discovery SPI. In fact, we
have all the necessary mechanisms for this—they are simply not used (see
DiscoverySpi#getRemoteNodes).
*Second Problem*
To properly validate joining nodes, the RU processor must track nodes that have
passed the RU processor's validation but are still in Phase 1 (the node is
validated by another Ignite component).
Currently, a joining node can be forced to leave the cluster after being
validated by the RU processor. Ignite components are not notified of this (see
RingMessageWorker#nodeCheckError). As a result, the RU processor
1. validates the new node
2. caches it as a node about to join (so that it is taken into account when
validating subsequent joining nodes)
3. cannot determine whether the joining node is still in the process of joining
or has just been kicked out from the cluster
*It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all
cases where a joining node is forced to leave the cluster due to a validation
error.
> Refactor TCP Discovery SPI joining node validation
> --------------------------------------------------
>
> Key: IGNITE-28751
> URL: https://issues.apache.org/jira/browse/IGNITE-28751
> Project: Ignite
> Issue Type: Task
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Motivation.
> The following problems arise during the implementation of the RU mechanism:
> 1. We need to validate joining nodes and ensure that the cluster does not
> contain nodes with more than two different product versions.
> 2. We need to ensure that the cluster contains nodes with only one version
> when the RU process is about to complete.
> Currently, join validation logic can be implemented in the
> GridComponent#validateNode() method, which is called for all joining nodes on
> the coordinator.
> However, the node join process in the TCP Discovery SPI consists of three
> phases:
> 1. node validation (see RingMessageWorker#processJoinRequestMessage)
> 2. node join process (data exchange between the joining node and the cluster)
> (see RingMessageWorker#processNodeAddedMessage /
> processNodeAddFinishedMessage)
> 3. node join completion (the node is added to the discovery cache and becomes
> visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)
> *First problem*
> While the joining node is in phase 2, the RU processor does not observe it
> via the GridDiscoveryManager#remoteNodes or similar methods and cannot
> properly check the current topology.
> *It is proposed* to introduce a new method in DiscoveryManager that returns
> remote nodes by directly querying the underlying Discovery SPI. In fact, we
> have all the necessary mechanisms for this—they are simply not used (see
> DiscoverySpi#getRemoteNodes).
> *Second Problem*
> To properly validate joining nodes, the RU processor must track nodes that
> have passed the RU processor's validation but are still in Phase 1 (the node
> is validated by another Ignite component).
> Currently, a joining node can be forced to leave the cluster after being
> validated by the RU processor. Ignite components are not notified of this
> (see RingMessageWorker#nodeCheckError). As a result, the RU processor
> 1. validates the new node
> 2. caches it as a node about to join (so that it is taken into account when
> validating subsequent joining nodes)
> 3. cannot determine whether the joining node is still in the process of
> joining or has just been kicked out from the cluster
> *It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all
> cases where a joining node is forced to leave the cluster due to a validation
> error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)