Actually - there is 

-XX:+SafepointTimeout

which will print out offending threads (assuming you reach a 10 second pause)…

That is probably your best bet.

> On Oct 24, 2014, at 2:38 PM, graham sanderson <gra...@vast.com> wrote:
> 
> This certainly sounds like a JVM bug.
> 
> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
> and don’t seem to have seen this (note we are on 7u67, so that might be an 
> interesting data point, though since the old thread predated that probably 
> not)
> 
> 1) From the app/java side, I’d obviously see if you can identify anything 
> which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads are 
> taking a long time to rendezvous at the safe point, and it is probably not 
> application threads, I’d look what GC threads, compiler threads etc might be 
> doing. As mentioned it shouldn’t be anything to do with operations which run 
> at a safe point anyway (e.g. scavenge)
>       a) So look at what CMS is doing at the time and see if you can correlate
>       b) Check Oracle for related bugs - didn’t obviously see any, but there 
> have been some complaints related to compilation and safe points
>       c) Add any compilation tracing you can
>       d) Kind of important here - see if you can figure out via dtrace, 
> system tap, gdb or whatever, what the threads are doing when this happens. 
> Sadly it doesn’t look like you can figure out when this is happening (until 
> afterwards) unless you have access to a debug JVM build (and can turn on 
> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
> update within a time period) - if you don’t have access to that, I guess you 
> could try and get a dump every 2-3 seconds (you should catch a 9 second pause 
> eventually!)
> 
>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvank...@salesforce.com 
>> <mailto:dvank...@salesforce.com>> wrote:
>> 
>> I'm also curious to know if this was ever resolved or if there's any other 
>> recommended steps to take to continue to track it down. I'm seeing the same 
>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
>> 1.7u71, using the CMS collector. Just as described above, the issue is long 
>> "Total time for which application threads were stopped" pauses that are not 
>> a direct result of GC pauses (ParNew, initial mark or remark). When I 
>> enabled the safepoint logging I saw the same result, long "sync" pause times 
>> with short spin and block times, usually with the "RevokeBias" description. 
>> We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty 
>> debilitating issue. Our machines are not swapping (or even close to it) or 
>> having other load issues when these pauses occur. Any ideas would be very 
>> appreciated. Thanks!
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to