[ 
https://issues.apache.org/jira/browse/IGNITE-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-18630:
---------------------------------------
    Summary: Try to deliver a message until receiver drops out from logical 
topology  (was: Try to deliver a message until node drops out from logical 
topology)

> Try to deliver a message until receiver drops out from logical topology
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-18630
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18630
>             Project: Ignite
>          Issue Type: Improvement
>          Components: networking
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> Currently, there are two topologies: physical (bound to Scalecube events 1:1) 
> and logical. Appearing in the physical topology (PT) starts validation which 
> (if successful) ends with addition to the logical topology (LT); dropping 
> from the PT immediately removes a node from the LT.
> We use PT as a set of nodes to which the current node can send messages. This 
> means that if ScaleCube loses a node from sight due to a transient glitch 
> (caused by a GC pause, for example), after which a node becomes visible 
> again, we still remove the node from the PT, making it impossible to deliver 
> a message to it; so transient network glitches harm the reliability of 
> messaging.
> The suggestion is to switch to the following:
>  # We decouple ScaleCube topology from the PT, so we now have 3 topologies: 
> ScaleCube topology (tracked via ScaleCube events) (these are nodes that are 
> thought to be alive by our node from the point of view of SWIM protocol), 
> physical topology (nodes which we consider as reachable and to which we can 
> send messages) and logical topology (nodes that passed validation and joined 
> the cluster)
>  # A node enters PT when it appears in the ScaleCube topology (ST), but it 
> leaves the PT when it leaves the LT
>  # Logical topology 'leave' events will be triggered by ST leave events, but 
> with a delay, so that if a node returns to the ST with same ScaleCube ID, LT 
> leave event is not fired
> Summing up:
>  # When a node appears in ST, it appears in PT
>  # When it appears in PT, validation process starts (which might lead to 
> adding the node to LT)
>  # When a node leaves ST, a delayed removal from LT is scheduled. It is 
> cancelled if the node appears in ST again
>  # When a node leaves LT, it leaves PT (making it impossible to send a 
> message to it)
>  # When doing a graceful shutdown, a node should send a 'graceful LT leave' 
> message so that it drops from their LT and PT immediately, without the 
> timeout defined in item 3.
> As LT events are distributed using RAFT, if a node loses ability to connect a 
> CMG leader, it will never drop other nodes from its PT, so it will try to 
> deliver messages for infinite time. This seems ok.
> One thing that should be considered is that {{TopologyService}} (for PT) and 
> {{LogicalTopologyService}} are defined in different modules, which might 
> cause difficulties when subscribing to each other events.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to