Re: Intermittent long application pauses on nodes

graham sanderson Fri, 24 Oct 2014 12:54:30 -0700

And -XX:SafepointTimeoutDelay=xxx

to set how long before it dumps output (defaults to 10000 I believe)…


Note it doesn’t actually timeout by default, it just prints the problematic 
threads after that time and keeps on waiting

> On Oct 24, 2014, at 2:44 PM, graham sanderson <gra...@vast.com> wrote:
> 
> Actually - there is 
> 
> -XX:+SafepointTimeout
> 
> which will print out offending threads (assuming you reach a 10 second pause)…
> 
> That is probably your best bet.
> 
>> On Oct 24, 2014, at 2:38 PM, graham sanderson <gra...@vast.com 
>> <mailto:gra...@vast.com>> wrote:
>> 
>> This certainly sounds like a JVM bug.
>> 
>> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
>> and don’t seem to have seen this (note we are on 7u67, so that might be an 
>> interesting data point, though since the old thread predated that probably 
>> not)
>> 
>> 1) From the app/java side, I’d obviously see if you can identify anything 
>> which always coincides with this - repair, compaction etc
>> 2) From the VM side (given that this as Benedict mentioned) some threads are 
>> taking a long time to rendezvous at the safe point, and it is probably not 
>> application threads, I’d look what GC threads, compiler threads etc might be 
>> doing. As mentioned it shouldn’t be anything to do with operations which run 
>> at a safe point anyway (e.g. scavenge)
>>      a) So look at what CMS is doing at the time and see if you can correlate
>>      b) Check Oracle for related bugs - didn’t obviously see any, but there 
>> have been some complaints related to compilation and safe points
>>      c) Add any compilation tracing you can
>>      d) Kind of important here - see if you can figure out via dtrace, 
>> system tap, gdb or whatever, what the threads are doing when this happens. 
>> Sadly it doesn’t look like you can figure out when this is happening (until 
>> afterwards) unless you have access to a debug JVM build (and can turn on 
>> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
>> update within a time period) - if you don’t have access to that, I guess you 
>> could try and get a dump every 2-3 seconds (you should catch a 9 second 
>> pause eventually!)
>> 
>>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvank...@salesforce.com 
>>> <mailto:dvank...@salesforce.com>> wrote:
>>> 
>>> I'm also curious to know if this was ever resolved or if there's any other 
>>> recommended steps to take to continue to track it down. I'm seeing the same 
>>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
>>> 1.7u71, using the CMS collector. Just as described above, the issue is long 
>>> "Total time for which application threads were stopped" pauses that are not 
>>> a direct result of GC pauses (ParNew, initial mark or remark). When I 
>>> enabled the safepoint logging I saw the same result, long "sync" pause 
>>> times with short spin and block times, usually with the "RevokeBias" 
>>> description. We're seeing pause times sometimes in excess of 10 seconds, so 
>>> it's a pretty debilitating issue. Our machines are not swapping (or even 
>>> close to it) or having other load issues when these pauses occur. Any ideas 
>>> would be very appreciated. Thanks!
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Intermittent long application pauses on nodes

Reply via email to