And -XX:SafepointTimeoutDelay=xxx to set how long before it dumps output (defaults to 10000 I believe)…
Note it doesn’t actually timeout by default, it just prints the problematic threads after that time and keeps on waiting > On Oct 24, 2014, at 2:44 PM, graham sanderson <gra...@vast.com> wrote: > > Actually - there is > > -XX:+SafepointTimeout > > which will print out offending threads (assuming you reach a 10 second pause)… > > That is probably your best bet. > >> On Oct 24, 2014, at 2:38 PM, graham sanderson <gra...@vast.com >> <mailto:gra...@vast.com>> wrote: >> >> This certainly sounds like a JVM bug. >> >> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, >> and don’t seem to have seen this (note we are on 7u67, so that might be an >> interesting data point, though since the old thread predated that probably >> not) >> >> 1) From the app/java side, I’d obviously see if you can identify anything >> which always coincides with this - repair, compaction etc >> 2) From the VM side (given that this as Benedict mentioned) some threads are >> taking a long time to rendezvous at the safe point, and it is probably not >> application threads, I’d look what GC threads, compiler threads etc might be >> doing. As mentioned it shouldn’t be anything to do with operations which run >> at a safe point anyway (e.g. scavenge) >> a) So look at what CMS is doing at the time and see if you can correlate >> b) Check Oracle for related bugs - didn’t obviously see any, but there >> have been some complaints related to compilation and safe points >> c) Add any compilation tracing you can >> d) Kind of important here - see if you can figure out via dtrace, >> system tap, gdb or whatever, what the threads are doing when this happens. >> Sadly it doesn’t look like you can figure out when this is happening (until >> afterwards) unless you have access to a debug JVM build (and can turn on >> -XX:+TraceSafepoint and look for a safe point start without a corresponding >> update within a time period) - if you don’t have access to that, I guess you >> could try and get a dump every 2-3 seconds (you should catch a 9 second >> pause eventually!) >> >>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvank...@salesforce.com >>> <mailto:dvank...@salesforce.com>> wrote: >>> >>> I'm also curious to know if this was ever resolved or if there's any other >>> recommended steps to take to continue to track it down. I'm seeing the same >>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM >>> 1.7u71, using the CMS collector. Just as described above, the issue is long >>> "Total time for which application threads were stopped" pauses that are not >>> a direct result of GC pauses (ParNew, initial mark or remark). When I >>> enabled the safepoint logging I saw the same result, long "sync" pause >>> times with short spin and block times, usually with the "RevokeBias" >>> description. We're seeing pause times sometimes in excess of 10 seconds, so >>> it's a pretty debilitating issue. Our machines are not swapping (or even >>> close to it) or having other load issues when these pauses occur. Any ideas >>> would be very appreciated. Thanks! >> >
smime.p7s
Description: S/MIME cryptographic signature