Nodes becoming unresponsive

2020-02-05 Thread Surbhi Gupta
Hi, We have noticed in a Cassandra Cluster , one of the node has 100% cpu utilization, using top we can see that cassandra process is showing futex_wait . We are on CentOS release 6.10 (Final) .As per below document the futex bug was on Centos 6.6 . https://support.datastax.com/hc/en-us/articles

Re: Nodes becoming unresponsive

2020-02-05 Thread Erick Ramirez
I wrote that article 5 years ago but I didn't think it would still be relevant today. 😁 Have you tried to do a thread dump to see which are the most dominant threads? That's the most effective way of troubleshooting high CPU situations. Cheers! >

Re: Nodes becoming unresponsive

2020-02-05 Thread Jeff Jirsa
The bug is in the kernel - it'd be worth looking at your specific kernel via `uname -a` just to confirm you're not somehow running an old kernel. If you're sure you're on a good kernel, then yea, thread inspection is your next step. https://github.com/aragozin/jvm-tools/blob/master/sjk-core/docs/TT

Re: Nodes becoming unresponsive

2020-02-05 Thread Erick Ramirez
Surbhi, just a *friendly* reminder that it's customary to reply back to the mailing list instead of emailing me directly so that everyone else in the list can participate. ☺ > I tried taking thread dump using kill -3 but it just came back and > no file generated. > How do you take the thread dum

Re: Nodes becoming unresponsive

2020-02-05 Thread Surbhi Gupta
Sure Eric... I tried strace as well ...

Re: Nodes becoming unresponsive

2020-02-06 Thread Elliott Sims
Async-profiler (https://github.com/jvm-profiling-tools/async-profiler ) flamegraphs can also be a really good tool to figure out the exact callgraph that's leading to the futex_wait, both in and out of the JVM.

Re: Nodes becoming unresponsive

2020-02-06 Thread Surbhi Gupta
I have limited options to use JDK based tools because in our environment we are running JRE . I tried to debug more and could see using top that Command is MutationStage in top output , Any clue we get from this ? top - 16:30:47 up 94 days, 5:33, 1 user, load average: 134.83, 142.48, 144.75 Ta

Re: Nodes becoming unresponsive

2020-02-06 Thread Erick Ramirez
> > I tried to debug more and could see using top that Command is > MutationStage in top output , Any clue we get from this ? > That just means there's lots of writes hitting your cluster. Without the thread dump, it would be difficult to know if the threads are blocked by futex_wait or whatever

Nodes becoming unresponsive intermediately (Gossip stage pending)

2017-01-18 Thread Sermandurai Konar
Hi, We have 11/11 node cluster running Cassandra 2.1.15 version. We are observing that 3 nodes from each data center are becoming unresponsive for short period of time. This behavior is happening only in 6 nodes (i.e. 3 from each data center) and we are seeing a lot of Gossip stage has pending tas