It sounds like you either have hot partition(s) or hardware issue on
that node. I;m mentioning hardware issue because I had a server with
faulty CPU fan and the CPU on it overheats and causes frequency
throttling, the result is a single server with much higher load than the
rest of the nodes in the cluster, and the symptom looks very similar to
hot partitions.
To answer your questions:
1. Slow queries can be a cause of GC pressure, a result of GC pressure,
or both. In my experience, it more often the GC pressure leads to slow
queries than the other way around.
2. This is very suspicious. I can think of a few causes, such as hot
partitions + token aware load balancing, client that only connects to a
single node in the cluster, heavy streaming activities within some token
ranges, bad retry policies, etc., and it's pretty hard to know what
exactly has happened without digging too deep into it. If this happens
frequently, you should properly investigate and fix it. This certainly
can lead to higher GC pressure on the affected node.
3. Just higher cache hits but the overall number of queries did not
change much, or even gone down? That would be an indicator of bad retry
policies. More cache hits alone won't cause much GC pressure, but the
underlaying issue lead to the higher cache hits may.
On 08/03/2022 21:44, Inquistive allen wrote:
Hello team,
On a given day , a node in 27 node cluster observed higher frequency
of garbage collection. Mostly young gc.
I have found below issues:
1. Higher number of slow queries being observed on that particular
node for that particular day compared to other days
2. Higher outgoing traffic observed from the node , 10 times the
average outbound traffic on that particular day
3. Higher number of cache requests hitting the key cache and chunk
cache that other days on the particular node
The cluster has large partition warning as well.
My query is, which of the above is a likely cause of higher frequency
of GC leading to High load average on the system.