Hello there,
We have a 100 nodes ish cluster, I find that there are dropped messages on
random nodes in the cluster, which caused error spikes and P99 latency
spikes as well.
I tried to figure out the cause. I do not see any obvious bottleneck in the
cluster, the C* nodes still have plenty of cpu idle/disk io. But I do see
some suspicious gossip events around that time, not sure if it's related.
2017-01-21_16:43:56.71033 WARN 16:43:56 [GossipTasks:1]: Not marking nodes
down due to local pause of 13079498815 > 5000000000
2017-01-21_16:43:56.85532 INFO 16:43:56 [ScheduledTasks:1]: MUTATION
messages were dropped in last 5000 ms: 65 for internal timeout and 10895
for cross node timeout
2017-01-21_16:43:56.85533 INFO 16:43:56 [ScheduledTasks:1]: READ messages
were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
node timeout
2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: Pool Name
Active Pending Completed Blocked All Time Blocked
2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: MutationStage
128 47794 1015525068 0 0
2017-01-21_16:43:56.85535
2017-01-21_16:43:56.85535 INFO 16:43:56 [ScheduledTasks:1]: ReadStage
64 20202 450508940 0 0
Any suggestions?
Thanks!
--
Dikang