[ 
https://issues.apache.org/jira/browse/IGNITE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Chugunov updated IGNITE-26986:
-------------------------------------
    Description: 
Connection recovery mechanism developed in IGNITE-7163 improves topology 
resilience against brief network instability. However it could cause the whole 
cluster to go down if a cross-DC network partitioning happens in a 
multi-datacenter environment.

This happens because connection recovery forces nodes to segment from topology 
when they cannot restore connection to the next node in a specified timeout. 
And if a node sits at the edge of its datacenter, and several of its next nodes 
are in the remote DC, then all attempts of the edge node to find an alive next 
will fail because of the partitioning. And if connection recovery timeout isn't 
big enough, the edge node will consider itself as segmented and stop.

Then the previous node of a newly failed one becomes an edge node, and the 
process repeats.

In this case connection recovery mechanism will force the whole cluster to 
shutdown instead of improving stability.

Thereby it should be aware on multi-datacenter envorinments and tweak its 
behavior accordingly.

  was:
Connection recovery mechanism developed in IGNITE-7163 improves topology 
resilience against brief network instability. However it could cause the whole 
cluster going down if a cross-DC network partitioning happens in a 
multi-datacenter environment.

This happens because connection recovery forces nodes to segment from topology 
when they cannot restore connection to the next node in a specified timeout. 
And if a node sits at the edge of its datacenter, and several of its next nodes 
are in the remote DC, then all attempts of the edge node to find an alive next 
will fail because of the partitioning. And if connection recovery timeout isn't 
big enough, the edge node will consider itself as segmented and stop.

Then the previous node of a newly failed one becomes an edge node, and the 
process repeats.

In this case connection recovery mechanism will force the whole cluster to 
shutdown instead of improving stability.

Thereby it should be aware on multi-datacenter envorinments and tweak its 
behavior accordingly.


> Multi-datacenter awarness for connection recovery mechanism
> -----------------------------------------------------------
>
>                 Key: IGNITE-26986
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26986
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Sergey Chugunov
>            Priority: Major
>              Labels: IEP-140
>             Fix For: 2.18
>
>
> Connection recovery mechanism developed in IGNITE-7163 improves topology 
> resilience against brief network instability. However it could cause the 
> whole cluster to go down if a cross-DC network partitioning happens in a 
> multi-datacenter environment.
> This happens because connection recovery forces nodes to segment from 
> topology when they cannot restore connection to the next node in a specified 
> timeout. And if a node sits at the edge of its datacenter, and several of its 
> next nodes are in the remote DC, then all attempts of the edge node to find 
> an alive next will fail because of the partitioning. And if connection 
> recovery timeout isn't big enough, the edge node will consider itself as 
> segmented and stop.
> Then the previous node of a newly failed one becomes an edge node, and the 
> process repeats.
> In this case connection recovery mechanism will force the whole cluster to 
> shutdown instead of improving stability.
> Thereby it should be aware on multi-datacenter envorinments and tweak its 
> behavior accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to