Hi, I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the following. The cluster does function, for a while, but then some stages begin to back up and the node does not recover and does not drain the tasks, even under no load. This happens both to MutationStage and GossipStage.
I do see the following exception happen in the logs: ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:2328,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.0.jar:3.11.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_91] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.0.jar:3.11.0] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] But it's hard to correlate precisely with things going bad. It is also very strange to me since I have both read_repair_chance and dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is confusing why ReadRepairStage would err. Anyone have thoughts on this? It's pretty muddling, and causes nodes to lock up. Once it happens Cassandra can't even shut down, I have to kill -9. If I can't find a resolution I'm going to need to downgrade and restore to backup... The only issue I found that looked similar is https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10. $ nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 582103 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 11 11 2868 0 0 MutationStage 32 4593678 55057393 0 0 GossipStage 1 2818 371487 0 0 RequestResponseStage 0 0 4345522 0 0 ReadRepairStage 0 0 151473 0 0 CounterMutationStage 0 0 0 0 0 MemtableFlushWriter 1 81 76 0 0 MemtablePostFlush 1 382 139 0 0 ValidationExecutor 0 0 0 0 0 ViewMutationStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 PerDiskMemtableFlushWriter_10 0 0 69 0 0 PerDiskMemtableFlushWriter_11 0 0 69 0 0 MemtableReclaimMemory 0 0 81 0 0 PendingRangeCalculator 0 0 32 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher 0 0 596 0 0 PerDiskMemtableFlushWriter_1 0 0 69 0 0 Native-Transport-Requests 11 0 4547746 0 67 PerDiskMemtableFlushWriter_2 0 0 69 0 0 MigrationStage 1 1545 586 0 0 PerDiskMemtableFlushWriter_0 0 0 80 0 0 Sampler 0 0 0 0 0 PerDiskMemtableFlushWriter_5 0 0 69 0 0 InternalResponseStage 0 0 45432 0 0 PerDiskMemtableFlushWriter_6 0 0 69 0 0 PerDiskMemtableFlushWriter_3 0 0 69 0 0 PerDiskMemtableFlushWriter_4 0 0 69 0 0 PerDiskMemtableFlushWriter_9 0 0 69 0 0 AntiEntropyStage 0 0 0 0 0 PerDiskMemtableFlushWriter_7 0 0 69 0 0 PerDiskMemtableFlushWriter_8 0 0 69 0 0 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 HINT 0 MUTATION 0 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 0 -dan