[ https://issues.apache.org/jira/browse/IGNITE-13980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13980: -------------------------------------- Description: Suggestion: remove duplicated ‘ping’, make the code simpler. To ensure some node isn't failed TcpDiscoverySpi has robust ping (TcpDiscoveryConnectionCheckMessage) and the backward connection check. But there is also status check message (TcpDiscoveryStatusCheckMessage) which looks outdated. This message was introduced with first versions of the discovery when the cluster stability and message delivery were under developing. Currently, TcpDiscoveryStatusCheckMessage is actually launched only at cluster start sometimes. And doesn't happen later due to the ping. The ping updates time of the message received which is the reason not to raise the status check. It is possible that node loses all incoming connection but keeps connection to next node. In this case the node gets removed from the ring by its follower. But cannot recognize the failure because it still successfully send message to next node. Instead of complex processing of TcpDiscoveryStatusCheckMessage, it iseems enough to answer on message 'OK, but you are not in the ring'. Every other node sees failure of malfunction node and can notify about it in the message response. The ticket has been additionally verified with the integration discovery test: https://github.com/apache/ignite/pull/8716 We can keep TcpDiscoveryStatusCheckMessage for backward compatibility with older versions of Ignite. The subtask (IGNITE-14053) suggest to completely remove TcpDiscoveryStatusCheckMessage. was: Suggestion: remove duplicated ‘ping’, make the code simpler. To ensure some node isn't failed TcpDiscoverySpi has robust ping (TcpDiscoveryConnectionCheckMessage) and the backward connection check. But there is also status check message (TcpDiscoveryStatusCheckMessage) which looks outdated. This message was introduced with first versions of the discovery when the cluster stability and message delivery were under developing. Currently, TcpDiscoveryStatusCheckMessage is actually launched only at cluster start sometimes. And doesn't happen later due to the ping. The ping updates time of the message received which is the reason not to raise the status check. It is possible that node loses all incoming connection but keeps connection to next node. In this case the node gets removed from the ring by its follower. But cannot recognize the failure because it still successfully send message to next node. Instead of complex processing of TcpDiscoveryStatusCheckMessage, it iseems enough to answer on message 'OK, but you are not in the ring'. Every other node sees failure of malfunction node and can notify about it in the message response. We can keep TcpDiscoveryStatusCheckMessage for backward compatibility with older versions of Ignite. The subtask (IGNITE-14053) suggest to completely remove TcpDiscoveryStatusCheckMessage. > Remove duplicated ping: processing and raising StatusCheckMessage. > ------------------------------------------------------------------ > > Key: IGNITE-13980 > URL: https://issues.apache.org/jira/browse/IGNITE-13980 > Project: Ignite > Issue Type: Improvement > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > Suggestion: remove duplicated ‘ping’, make the code simpler. > To ensure some node isn't failed TcpDiscoverySpi has robust ping > (TcpDiscoveryConnectionCheckMessage) and the backward connection check. But > there is also status check message (TcpDiscoveryStatusCheckMessage) which > looks outdated. This message was introduced with first versions of the > discovery when the cluster stability and message delivery were under > developing. > Currently, TcpDiscoveryStatusCheckMessage is actually launched only at > cluster start sometimes. And doesn't happen later due to the ping. The ping > updates time of the message received which is the reason not to raise the > status check. > It is possible that node loses all incoming connection but keeps connection > to next node. In this case the node gets removed from the ring by its > follower. But cannot recognize the failure because it still successfully send > message to next node. Instead of complex processing of > TcpDiscoveryStatusCheckMessage, it iseems enough to answer on message 'OK, > but you are not in the ring'. Every other node sees failure of malfunction > node and can notify about it in the message response. > The ticket has been additionally verified with the integration discovery > test: https://github.com/apache/ignite/pull/8716 > We can keep TcpDiscoveryStatusCheckMessage for backward compatibility with > older versions of Ignite. The subtask (IGNITE-14053) suggest to completely > remove TcpDiscoveryStatusCheckMessage. -- This message was sent by Atlassian Jira (v8.3.4#803005)