Hi, We have 11/11 node cluster running Cassandra 2.1.15 version. We are observing that 3 nodes from each data center are becoming unresponsive for short period of time. This behavior is happening only in 6 nodes (i.e. 3 from each data center) and we are seeing a lot of Gossip stage has pending task and periodic commit log syncer issue. This is resulting in huge mutation drop and the node also has high cpu usage compared to other nodes.
Log entry from system.log: WARN [GossipTasks:1] 2017-01-18 10:15:51,328 Gossiper.java:748 - Gossip stage has 2 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:52,428 Gossiper.java:748 - Gossip stage has 3 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:53,529 Gossiper.java:748 - Gossip stage has 3 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:54,629 Gossiper.java:748 - Gossip stage has 5 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:55,730 Gossiper.java:748 - Gossip stage has 7 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:56,830 Gossiper.java:748 - Gossip stage has 9 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:57,930 Gossiper.java:748 - Gossip stage has 12 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:15:59,031 Gossiper.java:748 - Gossip stage has 15 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2017-01-18 10:16:00,131 Gossiper.java:748 - Gossip stage has 16 pending tasks; skipping status check (no nodes will be marked down) WARN [PERIODIC-COMMIT-LOG-SYNCER] 2017-01-18 06:35:44,186 AbstractCommitLogService.java:105 - Out of 33 commit log syncs over the past 309s with average duration of 9506.03ms, 13 have exceeded the configured commit interval by an average of 8676.77ms *Observation:* These 6 nodes having issues are running in Red hat version 6.8 and all other nodes in clusters are in 6.7 version. The node in 6.7 version is not having any gossip issue/ mutation drop/ CPU issues/ IO. Also, the IOStats in these nodes is bad compared to other nodes. We also see that threads are blocked at HintedHandoff. *-bash-4.1$ sar 2 7* 10:29:51 AM CPU %user %nice %system %iowait %steal %idle 10:29:53 AM all 36.83 4.86 5.51 4.65 0.00 48.15 10:29:55 AM all 41.09 4.98 6.68 2.28 0.00 44.97 10:29:57 AM all 37.53 4.71 5.52 2.04 0.00 50.21 10:29:59 AM all 35.82 4.76 4.97 2.14 0.00 52.31 10:30:01 AM all 33.90 3.53 3.82 2.62 0.00 56.13 10:30:03 AM all 31.78 2.64 3.53 4.65 0.00 57.39 10:30:05 AM all 33.27 3.90 3.95 2.76 0.00 56.12 Average: all 35.74 4.20 4.85 3.02 0.00 52.19 *Thread are also blocked HintedHandoff:* "HintedHandoff:1" daemon prio=1 os_prio=4 tid=0x440 nid=0x44 [ JVM locked by VM (w/poll advisory bit) acquiring VM lock 'org.apache.cassandra.db.HintedHandOffManager', polling bits: safep rstak gpgc_clean_new ] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.cassandra.db.HintedHandOffManager.compact(HintedHandOffManager.java:269) at org.apache.cassandra.db.HintedHandOffManager$5.run(HintedHandOffManager.java:563) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) We are working with system team to see if there is any IO issue and is this issue associated with any futex_wait. Any other pointer to fix this issue will be great. Thanks, Sermandurai.