[ https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13012: -------------------------------------- Release Note: Fixed processing of failure detection timeout in TcpDiscoverySpi. If a node fails to send a message or ping, now it drops current connection strictly within this timeout and begins establishing new connection much faster. (was: Fixed processing of failure detection timeout in TcpDiscoverySpi. If a node fails to send a message or ping, now it drops current connection strictly within this timeout and begins to establish new connection much faster.) > Fix failure detection timeout. Simplify node ping routine. > ---------------------------------------------------------- > > Key: IGNITE-13012 > URL: https://issues.apache.org/jira/browse/IGNITE-13012 > Project: Ignite > Issue Type: Improvement > Affects Versions: 2.8.1 > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > Fix For: 2.9 > > Attachments: IGNITE-13012-patch.patch > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Connection failure may not be detected within > IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: > ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. > Node ping routine is duplicated. > We should fix: > 1. Failure detection timeout should take in account last sent message. > Current ping is bound to own time: > {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code} > This is weird because any discovery message check connection. > 2. Make connection check interval depend on failure detection timeout (FTD). > Current value is a constant: > {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code} > 3. Remove additional, quickened connection checking. Once we do fix 1, this > will become even more useless. > Despite TCP discovery has a period of connection checking, it may send ping > before this period exhausts. This premature ping relies also on the time of > any received message for some reason. > 4. Do not worry user with “Node seems disconnected” when everything is OK. > Once we do fix 1 and 3, this will become even more useless. > Node may log on INFO: “Local node seems to be disconnected from topology …” > whereas it is not actually disconnected at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)