Actually - there is -XX:+SafepointTimeout
which will print out offending threads (assuming you reach a 10 second pause)… That is probably your best bet. > On Oct 24, 2014, at 2:38 PM, graham sanderson <gra...@vast.com> wrote: > > This certainly sounds like a JVM bug. > > We are running C* 2.0.9 on pretty high end machines with pretty large heaps, > and don’t seem to have seen this (note we are on 7u67, so that might be an > interesting data point, though since the old thread predated that probably > not) > > 1) From the app/java side, I’d obviously see if you can identify anything > which always coincides with this - repair, compaction etc > 2) From the VM side (given that this as Benedict mentioned) some threads are > taking a long time to rendezvous at the safe point, and it is probably not > application threads, I’d look what GC threads, compiler threads etc might be > doing. As mentioned it shouldn’t be anything to do with operations which run > at a safe point anyway (e.g. scavenge) > a) So look at what CMS is doing at the time and see if you can correlate > b) Check Oracle for related bugs - didn’t obviously see any, but there > have been some complaints related to compilation and safe points > c) Add any compilation tracing you can > d) Kind of important here - see if you can figure out via dtrace, > system tap, gdb or whatever, what the threads are doing when this happens. > Sadly it doesn’t look like you can figure out when this is happening (until > afterwards) unless you have access to a debug JVM build (and can turn on > -XX:+TraceSafepoint and look for a safe point start without a corresponding > update within a time period) - if you don’t have access to that, I guess you > could try and get a dump every 2-3 seconds (you should catch a 9 second pause > eventually!) > >> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvank...@salesforce.com >> <mailto:dvank...@salesforce.com>> wrote: >> >> I'm also curious to know if this was ever resolved or if there's any other >> recommended steps to take to continue to track it down. I'm seeing the same >> issue in our production cluster, which is running Cassandra 2.0.10 and JVM >> 1.7u71, using the CMS collector. Just as described above, the issue is long >> "Total time for which application threads were stopped" pauses that are not >> a direct result of GC pauses (ParNew, initial mark or remark). When I >> enabled the safepoint logging I saw the same result, long "sync" pause times >> with short spin and block times, usually with the "RevokeBias" description. >> We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty >> debilitating issue. Our machines are not swapping (or even close to it) or >> having other load issues when these pauses occur. Any ideas would be very >> appreciated. Thanks! >
smime.p7s
Description: S/MIME cryptographic signature