[ 
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138300#comment-17138300
 ] 

Vladimir Steshin edited comment on IGNITE-13111 at 6/17/20, 10:06 AM:
----------------------------------------------------------------------

I find IGNITE-13016 as better solution. We cannot rely on ping interval because 
two nodes are involved in backward connection checking. They work with same but 
shifted ping intervals. If node N asks N+2 to check N+1, N+2 waits for the rest 
of its failureDetectionTimeout. But ping and failureDetectionTimeout on N are 
shifted in comparision with N+2. N can fail before N+2 has waited for ping from 
N+1.


was (Author: vladsz83):
I find IGNITE-13016 or IGNITE-13014 better solution. We cannot rely on ping 
interval because two nodes are involved in backward connection checking. They 
work with same but shifted ping intervals. If node N asks N+2 to check N+1, N+2 
waits for the rest of its failureDetectionTimeout. But ping and 
failureDetectionTimeout on N are shifted in comparision with N+2. N can fail 
before N+2 has waited for ping from N+1.

> Simplify backward checking of node connection.
> ----------------------------------------------
>
>                 Key: IGNITE-13111
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13111
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>         Attachments: FailureDetectionResearch.patch, 
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, 
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They 
> prolong node failure detection upto: 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. 
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' 
> which emulates long answears on a failed node and measures failure detection 
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012. 
> Once we get robust, predictable connection ping, we don't need to check 
> previous node because we can see whether it sent ping to current node within 
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
>                         // Need to check connectivity to it.
>                         long rcvdTime = lastRingMsgReceivedTime;
>                         long now = U.currentTimeMillis();
>                         // We got message from previous in less than double 
> connection check interval.
>                         boolean ok = rcvdTime + effectiveExchangeTimeout() >= 
> now;
>                         TcpDiscoveryNode previous = null;
>                         if (ok) {
>                             // Check case when previous node suddenly died. 
> This will speed up
>                             // node failing.
>                           Checking connection to previous node
>                          }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is 
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to