Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try and see if that gives us anything to act on.
On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson <gra...@vast.com> wrote: > And -XX:SafepointTimeoutDelay=xxx > > to set how long before it dumps output (defaults to 10000 I believe)… > > Note it doesn’t actually timeout by default, it just prints the > problematic threads after that time and keeps on waiting > > On Oct 24, 2014, at 2:44 PM, graham sanderson <gra...@vast.com> wrote: > > Actually - there is > > -XX:+SafepointTimeout > > which will print out offending threads (assuming you reach a 10 second > pause)… > > That is probably your best bet. > > On Oct 24, 2014, at 2:38 PM, graham sanderson <gra...@vast.com> wrote: > > This certainly *sounds* like a JVM bug. > > We are running C* 2.0.9 on pretty high end machines with pretty large > heaps, and don’t seem to have seen this (note we are on 7u67, so that might > be an interesting data point, though since the old thread predated that > probably not) > > 1) From the app/java side, I’d obviously see if you can identify anything > which always coincides with this - repair, compaction etc > 2) From the VM side (given that this as Benedict mentioned) some threads > are taking a long time to rendezvous at the safe point, and it is probably > not application threads, I’d look what GC threads, compiler threads etc > might be doing. As mentioned it shouldn’t be anything to do with operations > which run at a safe point anyway (e.g. scavenge) > a) So look at what CMS is doing at the time and see if you can correlate > b) Check Oracle for related bugs - didn’t obviously see any, but there > have been some complaints related to compilation and safe points > c) Add any compilation tracing you can > d) Kind of important here - see if you can figure out via dtrace, system > tap, gdb or whatever, what the threads are doing when this happens. Sadly > it doesn’t look like you can figure out when this is happening (until > afterwards) unless you have access to a debug JVM build (and can turn on > -XX:+TraceSafepoint and look for a safe point start without a corresponding > update within a time period) - if you don’t have access to that, I guess > you could try and get a dump every 2-3 seconds (you should catch a 9 second > pause eventually!) > > On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvank...@salesforce.com> > wrote: > > I'm also curious to know if this was ever resolved or if there's any other > recommended steps to take to continue to track it down. I'm seeing the same > issue in our production cluster, which is running Cassandra 2.0.10 and JVM > 1.7u71, using the CMS collector. Just as described above, the issue is long > "Total time for which application threads were stopped" pauses that are not > a direct result of GC pauses (ParNew, initial mark or remark). When I > enabled the safepoint logging I saw the same result, long "sync" pause > times with short spin and block times, usually with the "RevokeBias" > description. We're seeing pause times sometimes in excess of 10 seconds, so > it's a pretty debilitating issue. Our machines are not swapping (or even > close to it) or having other load issues when these pauses occur. Any ideas > would be very appreciated. Thanks! > > > > >