[ https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin resolved IGNITE-13111. --------------------------------------- Resolution: Won't Fix I find IGNITE-13016 as better solution. We cannot rely on ping interval because two nodes are involved in backward connection checking. They work with same but shifted ping intervals. If node N asks N+2 to check N+1, N+2 waits for the rest of its failureDetectionTimeout. But ping and failureDetectionTimeout on N are shifted in comparision with N+2. N can fail before N+2 has waited for ping from N+1. > Simplify backward checking of node connection. > ---------------------------------------------- > > Key: IGNITE-13111 > URL: https://issues.apache.org/jira/browse/IGNITE-13111 > Project: Ignite > Issue Type: Improvement > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.patch, > FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, > WostCaseStepByStep.txt > > > We should fix several drawbacks in the backward checking of failed node. They > prolong node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_FailureDetectionResearch.txt_' - results of the test. > * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestion:* > 1) We can simplify backward connection checking as we implement IGNITE-13012. > Once we get robust, predictable connection ping, we don't need to check > previous node because we can see whether it sent ping to current node within > failure detection timeout. If not, previous node can be considered lost. > Instead of: > {code:java} > // Node cannot connect to it's next (for local node it's previous). > // Need to check connectivity to it. > long rcvdTime = lastRingMsgReceivedTime; > long now = U.currentTimeMillis(); > // We got message from previous in less than double > connection check interval. > boolean ok = rcvdTime + effectiveExchangeTimeout() >= > now; > TcpDiscoveryNode previous = null; > if (ok) { > // Check case when previous node suddenly died. > This will speed up > // node failing. > Checking connection to previous node > } > {code} > we could wait for ping from previous node. Scenario: > * n1 (Node1) failed to connect to n2. > * n1 asks n3 to establish connection instead of n2. > * n3 waits for ping form n2 for the rest of failure detection timeout. > * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is > considered alive. > 2) Then, seems we can remove: > {code:java} > ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)