[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Labels: iep-45  (was: )

> Make node connection checking rely on the configuration. Simplify node ping 
> routine.
> ------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>
> Current noted-to-node connection checking has several drawbacks:
> 1)    Minimal connection checking interval is not bound to failure detection 
> parameters: 
> static int ServerImpls.CON_CHECK_INTERVAL = 500;
> 2)    Connection checking is made as ability of periodical message sending 
> (TcpDiscoveryConnectionCheckMessage). It is bound to own time (ServerImpl. 
> RingMessageWorker.lastTimeConnCheckMsgSent), not to common time of last sent 
> message. This is weird because any discovery message actually checks 
> connection. And TpDiscoveryConnectionCheckMessage is just an addition when 
> message queue is empty for a long time.
> 3)    Period of Node-to-Node connection checking can be sometimes shortened 
> for strange reason: if no sent or received message appears within 
> failureDetectionTimeout. Here, despite we have minimal period of connection 
> checking (ServerImpls.CON_CHECK_INTERVAL), we can also send 
> TpDiscoveryConnectionCheckMessage before this period exhausted. Moreover, 
> this premature node ping relies also on time of last received message. 
> Imagine: if node 2 receives no message from node 1 within some time it 
> decides to do extra ping node 3 not waiting for regular ping interval. Such 
> behavior makes confusion and gives no additional guaranties.
> 4)    If #3 happens, node writes in the log on INFO: “Local node seems to be 
> disconnected from topology …” whereas it is not actually disconnected. User 
> can see this message if he typed failureDetectionTimeout < 500ms. I wouldn’t 
> like seeing INFO in a log saying a node is might be disconnected. This sounds 
> like some troubles raised in network. But not as everything is OK. 
> Suggestions:
> 1)    Make connection check interval be based on failureDetectionTimeout or 
> similar params.
> 2)    Make connection check interval rely on common time of last sent 
> message. Not on dedicated time.
> 3)    Remove additional, random, quickened connection checking.
> 4)    Do not worry user with “Node disconnected” when everything is OK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to