[ https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13012: -------------------------------------- Labels: iep-45 (was: ) > Make node connection checking rely on the configuration. Simplify node ping > routine. > ------------------------------------------------------------------------------------ > > Key: IGNITE-13012 > URL: https://issues.apache.org/jira/browse/IGNITE-13012 > Project: Ignite > Issue Type: Improvement > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > > Current noted-to-node connection checking has several drawbacks: > 1) Minimal connection checking interval is not bound to failure detection > parameters: > static int ServerImpls.CON_CHECK_INTERVAL = 500; > 2) Connection checking is made as ability of periodical message sending > (TcpDiscoveryConnectionCheckMessage). It is bound to own time (ServerImpl. > RingMessageWorker.lastTimeConnCheckMsgSent), not to common time of last sent > message. This is weird because any discovery message actually checks > connection. And TpDiscoveryConnectionCheckMessage is just an addition when > message queue is empty for a long time. > 3) Period of Node-to-Node connection checking can be sometimes shortened > for strange reason: if no sent or received message appears within > failureDetectionTimeout. Here, despite we have minimal period of connection > checking (ServerImpls.CON_CHECK_INTERVAL), we can also send > TpDiscoveryConnectionCheckMessage before this period exhausted. Moreover, > this premature node ping relies also on time of last received message. > Imagine: if node 2 receives no message from node 1 within some time it > decides to do extra ping node 3 not waiting for regular ping interval. Such > behavior makes confusion and gives no additional guaranties. > 4) If #3 happens, node writes in the log on INFO: “Local node seems to be > disconnected from topology …” whereas it is not actually disconnected. User > can see this message if he typed failureDetectionTimeout < 500ms. I wouldn’t > like seeing INFO in a log saying a node is might be disconnected. This sounds > like some troubles raised in network. But not as everything is OK. > Suggestions: > 1) Make connection check interval be based on failureDetectionTimeout or > similar params. > 2) Make connection check interval rely on common time of last sent > message. Not on dedicated time. > 3) Remove additional, random, quickened connection checking. > 4) Do not worry user with “Node disconnected” when everything is OK. -- This message was sent by Atlassian Jira (v8.3.4#803005)