[ 
https://issues.apache.org/jira/browse/IGNITE-14068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-14068:
--------------------------------------
    Description: 
If node loses +outgoing+ connections, it can decide it is alone in the cluster 
and won't fail. Happens on small clusters where failed node is able to 
unsuccessfully try to connect to all other nodes before _connRecoveryTimeout_ 
expires.

Consider:
The cluster n1 -> n2 -> n3 -> n4 -> n1

* n4 looses all outgoing connections.
* n3 keeps successful ping to n4.
* n4 attempts to connect to n1, n2, n3. Fails with each due to outgoing network 
failure.
* spi.connrecoveryTimeout is not reached. n4 decides it is alone and continues 
working.
* n3 still sends messages to n4. n4 does not lack incoming connections.
* ring is actually broken because of n4. n3 cannot determine failure of n4. 



  was:
If node loses +outcoming+ connections, it can decide it is alone in the cluster 
and won't fail. Happens on small clusters where failed node is able to 
unsuccessfully try to connect to all other nodes before _connRecoveryTimeout_ 
expires.

Consider:
The cluster n1 -> n2 -> n3 -> n4 -> n1

* n4 looses all outgoing connections.
* n3 keeps successful ping to n4.
* n4 attempts to connect to n1, n2, n3. Fails with each due to outgoing network 
failure.
* spi.connrecoveryTimeout is not reached. n4 decides it is alone and continues 
working.
* n3 still sends messages to n4. n4 does not lack incoming connections.
* ring is actually broken because of n4. n3 cannot determine failure of n4. 




> Infinite node persistance in the ring while outcoming connections are lost
> --------------------------------------------------------------------------
>
>                 Key: IGNITE-14068
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14068
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If node loses +outgoing+ connections, it can decide it is alone in the 
> cluster and won't fail. Happens on small clusters where failed node is able 
> to unsuccessfully try to connect to all other nodes before 
> _connRecoveryTimeout_ expires.
> Consider:
> The cluster n1 -> n2 -> n3 -> n4 -> n1
> * n4 looses all outgoing connections.
> * n3 keeps successful ping to n4.
> * n4 attempts to connect to n1, n2, n3. Fails with each due to outgoing 
> network failure.
> * spi.connrecoveryTimeout is not reached. n4 decides it is alone and 
> continues working.
> * n3 still sends messages to n4. n4 does not lack incoming connections.
> * ring is actually broken because of n4. n3 cannot determine failure of n4. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to