A bit of background:
When TcpDiscoverySpi is used, TcpDiscoveryMetricsUpdateMessage is sent by a
coordinator once in metricsUpdateFrequency, which is 2 seconds by default.
It serves as a ping message, which ensures, that the ring is connected, and
all nodes are alive. These messages have a high priority, i.e., they are
put into the head of the RightMessageWorker's queue instead of its tail.

Now consider a situation, when a single link between two nodes in the ring
works slowly. It may receive and deliver all messages, but with a certain
delay. This situation is possible, when network is unstable or one of the
nodes experiences a lot of short GC pauses.
It leads to a growing message queue on the node, that stands before the
slow link. The worst part is that if high priority messages are generated
faster than they are processed, then other messages won't even get a chance
to be processed. Thus, no nodes will be kicked out of the cluster, but no
useful progress will happen. Partition map exchange may hang for this
reason, and the reason won't be obvious from the logs.

JIRA ticket: https://issues.apache.org/jira/browse/IGNITE-10808
I made a draft of the fix for this problem:
https://github.com/apache/ignite/pull/5771
The PR also contains a test, that reproduces this situation.

The patch makes sure, that only a single TcpDiscoveryMetricsUpdateMessage
of one kind is stored in the queue, i.e., if a newer message comes to the
node, then the old one is discarded. It also checks, that regular messages
are also processed. If the last processed message had a high priority, then
new high-priority message won't be put to the head of the queue, but to the
tail instead.
The fix addresses the described problem, but increases the utilization of
the discovery threads a bit.

What do you think of this fix? Do you have better ideas, how to improve the
heartbeat mechanism to avoid such situations?

Denis

Reply via email to