[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Description: 
Node-to-next-node connection checking has several drawbacks which go together. 
We should fix the following :

1) First thing firts, make connection check interval predictable and dependable 
on the failureDetectionTimeout or similar params. Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

2) Make connection check interval rely on common time of any last sent message. 
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message does actually check connection. And 
TpDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time.

3) Remove additional, randomly appearing quickened connection checking.  Once 
we do #1, this will become even more useless.
Despite we have a period of connection checking (see #1), we can also send ping 
before the period exhausts. This premature node ping relies on the time of any 
sent or even received message. Imagine: if node 2 receives no message from node 
1 within some time, it decides to do extra ping node 3 not waiting for regular 
ping. This happens quite randomly. Such behavior makes confusion and gives no 
benefits. 

4) Do not worry user with “Node disconnected” when everything is OK. Once we do 
#1, this will become even more useless.
If #3 happens, node writes in the log on INFO: “Local node seems to be 
disconnected from topology …” whereas it is not actually disconnected at all. 
User can see this unexpected and worrying message if he typed 
failureDetectionTimeout < 500ms.

  was:
Node-to-next-node connection checking has several drawbacks. we shoul fix:

1) Make connection check interval dependable on failureDetectionTimeout or 
similar params. Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500;{code}

2) Connection checking isn't based on time of the last message sent. 
TcpDiscoveryConnectionCheckMessage is bound to own time (ServerImpl. 
RingMessageWorker.lastTimeConnCheckMsgSent). This is weird because any 
discovery message actually checks connection. And 
TpDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time.

3) Period of Node-to-Node connection checking can be sometimes shortened for 
strange reason: if no sent or received message appears within 
failureDetectionTimeout. Here, despite we have minimal period of connection 
checking (ServerImpls.CON_CHECK_INTERVAL), we can also send 
TpDiscoveryConnectionCheckMessage before this period exhausted. Moreover, this 
premature node ping relies also on the time of last received message. Imagine: 
if node 2 receives no message from node 1 within some time it decides to do 
extra ping node 3 not waiting for regular ping interval. Such behavior makes 
confusion and gives no additional guaranties.

4) If #3 happens, node writes in the log on INFO: “Local node seems to be 
disconnected from topology …” whereas it is not actually disconnected. User can 
see this message if he typed failureDetectionTimeout < 500ms. I wouldn’t like 
seeing INFO in a log saying a node is might be disconnected. This sounds like 
some troubles raised in network. But not as everything is OK.

Suggestions:

 2) Make connection check interval rely on common time of last sent message. 
Not on dedicated time.
 3) Remove additional, random, quickened connection checking.
 4) Do not worry user with “Node disconnected” when everything is OK.


> Make node connection checking rely on the configuration. Simplify node ping 
> routine.
> ------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>
> Node-to-next-node connection checking has several drawbacks which go 
> together. We should fix the following :
> 1) First thing firts, make connection check interval predictable and 
> dependable on the failureDetectionTimeout or similar params. Current value is 
> a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 2) Make connection check interval rely on common time of any last sent 
> message. Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message does actually check connection. 
> And TpDiscoveryConnectionCheckMessage is just an addition when message queue 
> is empty for a long time.
> 3) Remove additional, randomly appearing quickened connection checking.  Once 
> we do #1, this will become even more useless.
> Despite we have a period of connection checking (see #1), we can also send 
> ping before the period exhausts. This premature node ping relies on the time 
> of any sent or even received message. Imagine: if node 2 receives no message 
> from node 1 within some time, it decides to do extra ping node 3 not waiting 
> for regular ping. This happens quite randomly. Such behavior makes confusion 
> and gives no benefits. 
> 4) Do not worry user with “Node disconnected” when everything is OK. Once we 
> do #1, this will become even more useless.
> If #3 happens, node writes in the log on INFO: “Local node seems to be 
> disconnected from topology …” whereas it is not actually disconnected at all. 
> User can see this unexpected and worrying message if he typed 
> failureDetectionTimeout < 500ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to