Hi y'all, I am writing to a cluster fairly fast and seeing this odd behavior happen, seemingly to single nodes at a time. The node starts to take more and more memory (instance has 48GB memory on G1GC). tpstats shows that MemtableReclaimMemory Pending starts to grow first, then later MutationStage builds up as well. By then most of the memory is being consumed, GC is getting longer, node slows down and everything slows down unless I kill the node. Also the number of Active MemtableReclaimMemory threads seems to stay at 1. Also interestingly, neither CPU nor disk utilization are pegged while this is going on; it's on jbod and there is plenty of headroom there. (Note that there is a decent number of compactions going on as well but that is expected on these nodes and this particular one is catching up from a high volume of writes).
Anyone have any theories on why this would be happening? $ nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked MutationStage 192 715481 311327142 0 0 ReadStage 7 0 9142871 0 0 RequestResponseStage 1 0 690823199 0 0 ReadRepairStage 0 0 2145627 0 0 CounterMutationStage 0 0 0 0 0 HintedHandoff 0 0 144 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 12 24 41022 0 0 MemtableReclaimMemory 1 102 4263 0 0 PendingRangeCalculator 0 0 10 0 0 GossipStage 0 0 148329 0 0 MigrationStage 0 0 0 0 0 MemtablePostFlush 0 0 5233 0 0 ValidationExecutor 0 0 0 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 4270 0 0 InternalResponseStage 0 0 16322698 0 0 AntiEntropyStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 25 0 547935519 0 2586907 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 MUTATION 287057 COUNTER_MUTATION 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 149