A bit of background: When TcpDiscoverySpi is used, TcpDiscoveryMetricsUpdateMessage is sent by a coordinator once in metricsUpdateFrequency, which is 2 seconds by default. It serves as a ping message, which ensures, that the ring is connected, and all nodes are alive. These messages have a high priority, i.e., they are put into the head of the RightMessageWorker's queue instead of its tail.
Now consider a situation, when a single link between two nodes in the ring works slowly. It may receive and deliver all messages, but with a certain delay. This situation is possible, when network is unstable or one of the nodes experiences a lot of short GC pauses. It leads to a growing message queue on the node, that stands before the slow link. The worst part is that if high priority messages are generated faster than they are processed, then other messages won't even get a chance to be processed. Thus, no nodes will be kicked out of the cluster, but no useful progress will happen. Partition map exchange may hang for this reason, and the reason won't be obvious from the logs. JIRA ticket: https://issues.apache.org/jira/browse/IGNITE-10808 I made a draft of the fix for this problem: https://github.com/apache/ignite/pull/5771 The PR also contains a test, that reproduces this situation. The patch makes sure, that only a single TcpDiscoveryMetricsUpdateMessage of one kind is stored in the queue, i.e., if a newer message comes to the node, then the old one is discarded. It also checks, that regular messages are also processed. If the last processed message had a high priority, then new high-priority message won't be put to the head of the queue, but to the tail instead. The fix addresses the described problem, but increases the utilization of the discovery threads a bit. What do you think of this fix? Do you have better ideas, how to improve the heartbeat mechanism to avoid such situations? Denis