[
https://issues.apache.org/jira/browse/IGNITE-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-18630:
---------------------------------------
Summary: Try to deliver a message until receiver drops out from logical
topology (was: Try to deliver a message until node drops out from logical
topology)
> Try to deliver a message until receiver drops out from logical topology
> -----------------------------------------------------------------------
>
> Key: IGNITE-18630
> URL: https://issues.apache.org/jira/browse/IGNITE-18630
> Project: Ignite
> Issue Type: Improvement
> Components: networking
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> Currently, there are two topologies: physical (bound to Scalecube events 1:1)
> and logical. Appearing in the physical topology (PT) starts validation which
> (if successful) ends with addition to the logical topology (LT); dropping
> from the PT immediately removes a node from the LT.
> We use PT as a set of nodes to which the current node can send messages. This
> means that if ScaleCube loses a node from sight due to a transient glitch
> (caused by a GC pause, for example), after which a node becomes visible
> again, we still remove the node from the PT, making it impossible to deliver
> a message to it; so transient network glitches harm the reliability of
> messaging.
> The suggestion is to switch to the following:
> # We decouple ScaleCube topology from the PT, so we now have 3 topologies:
> ScaleCube topology (tracked via ScaleCube events) (these are nodes that are
> thought to be alive by our node from the point of view of SWIM protocol),
> physical topology (nodes which we consider as reachable and to which we can
> send messages) and logical topology (nodes that passed validation and joined
> the cluster)
> # A node enters PT when it appears in the ScaleCube topology (ST), but it
> leaves the PT when it leaves the LT
> # Logical topology 'leave' events will be triggered by ST leave events, but
> with a delay, so that if a node returns to the ST with same ScaleCube ID, LT
> leave event is not fired
> Summing up:
> # When a node appears in ST, it appears in PT
> # When it appears in PT, validation process starts (which might lead to
> adding the node to LT)
> # When a node leaves ST, a delayed removal from LT is scheduled. It is
> cancelled if the node appears in ST again
> # When a node leaves LT, it leaves PT (making it impossible to send a
> message to it)
> # When doing a graceful shutdown, a node should send a 'graceful LT leave'
> message so that it drops from their LT and PT immediately, without the
> timeout defined in item 3.
> As LT events are distributed using RAFT, if a node loses ability to connect a
> CMG leader, it will never drop other nodes from its PT, so it will try to
> deliver messages for infinite time. This seems ok.
> One thing that should be considered is that {{TopologyService}} (for PT) and
> {{LogicalTopologyService}} are defined in different modules, which might
> cause difficulties when subscribing to each other events.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)