My searching my list archives shows this thread evaporated. Was a root cause ever found? Very curious.
On Mon, Feb 3, 2014 at 11:52 AM, Benedict Elliott Smith < belliottsm...@datastax.com> wrote: > Hi Frank, > > The "9391" under RevokeBias is the number of milliseconds spent > synchronising on the safepoint prior to the VM operation, i.e. the time it > took to ensure all application threads were stopped. So this is the > culprit. Notice that the time spent spinning/blocking for the threads we > are supposed to be waiting on is very low; it looks to me that this is time > spent waiting for CMS threads to yield, though it is very hard to say with > absolute certainty. It doesn't look like the issue is actually the > RevokeBias itself, anyway. > > I think we should take this off list. It definitely warrants a ticket, > though I expect this will be difficult to pin down, so you will have to be > willing to experiment a bit with us, but we would be very grateful for the > help. If you can pin down and share a specific workload that triggers this > we may be able to do it without you though! > > It's possible that this is a JVM issue, but if so there may be some > remedial action we can take anyway. There are some more flags we should > add, but we can discuss that once you open a ticket. If you could include > the strange JMX error as well, that might be helpful. > > Thanks, > > Benedict > > > On 3 February 2014 15:34, Frank Ng <fnt...@gmail.com> wrote: > >> I was able to send SafePointStatistics to another log file via the >> additional JVM flags and recently noticed a pause of 9.3936600 seconds. >> Here are the log entries: >> >> GC Log file: >> ------------------- >> 2014-01-31T07:49:14.755-0500: 137460.842: Total time for which >> application threads were stopped: 0.1095540 seconds >> 2014-01-31T07:51:01.870-0500: 137567.957: Total time for which >> application threads were stopped: 9.3936600 seconds >> 2014-01-31T07:51:02.537-0500: 137568.623: Total time for which >> application threads were stopped: 0.1207440 seconds >> >> JVM Stdout Log File: >> ------------------------------- >> vmop [threads: total initially_running >> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >> 137460.734: GenCollectForAllocation [ 421 >> 0 0 ] [ 0 0 23 0 84 ] 0 >> vmop [threads: total initially_running >> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >> 137558.562: RevokeBias [ 462 >> 2 9 ] [ 13 0 9391 1 0 ] 0 >> <writer thread='47436187662656'/> >> <dependency_failed type='leaf_type' >> ctxk='javax/management/ObjectName$Property' >> witness='javax/management/ObjectName$PatternProperty' stamp='137568.503'/> >> <writer thread='47436033530176'/> >> vmop [threads: total initially_running >> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >> 137568.500: Deoptimize [ 481 >> 1 5 ] [ 0 0 118 0 1 ] 0 >> vmop [threads: total initially_running >> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >> 137569.625: no vm operation [ 483 >> 0 1 ] [ 0 0 18 0 0 ] 0 >> vmop [threads: total initially_running >> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >> 137571.641: no vm operation [ 483 >> 0 1 ] [ 0 0 42 1 0 ] 0 >> vmop [threads: total initially_running >> wait_to_block] [time: spin block sync cleanup vmop] page_trap_count >> 137575.703: no vm operation [ 483 >> 0 1 ] [ 0 0 25 1 0 ] 0 >> >> If SafepointStatistics are printed before the Application Stop times, >> then it seems that the RevokeBias was the cause of the pause. >> If SafepointStatistics are printed after the Application Stop times, then >> it seems that the Deoptimize was the cause of the pause. >> In addition, I see a strange dependency failed error relating to JMX in >> the JVM stdout log file. >> >> thanks.... >> >> >> On Wed, Jan 29, 2014 at 4:44 PM, Benedict Elliott Smith < >> belliottsm...@datastax.com> wrote: >> >>> Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path} >>> -XX:+LogVMOutput >>> >>> I never figured out what kills stdout for C*. It's a library we depend >>> on, didn't try too hard to figure out which one. >>> >>> >>> On 29 January 2014 21:07, Frank Ng <fnt...@gmail.com> wrote: >>> >>>> Benedict, >>>> Thanks for the advice. I've tried turning on >>>> PrintSafepointStatistics. However, that info is only sent to the STDOUT >>>> console. The cassandra startup script closes the STDOUT when it finishes, >>>> so nothing is shown for safepoint statistics once it's done starting up. >>>> Do you know how to startup cassandra and send all stdout to a log file and >>>> tell cassandra not to close stdout? >>>> >>>> Also, we have swap turned off as recommended. >>>> >>>> thanks >>>> >>>> >>>> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith < >>>> belliottsm...@datastax.com> wrote: >>>> >>>>> >>>>> Frank, >>>>> >>>>> >>>>> >>>>> The same advice for investigating holds: add the VM flags >>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 (you >>>>> could put something above 1 there, to reduce the amount of logging, since >>>>> a pause of 52s will be pretty obvious even if aggregated with lots of >>>>> other safe points; the count is the number of safepoints to aggregate >>>>> into one log message) >>>>> >>>>> >>>>> >>>>> 52s is a very extreme pause, and I would be surprised if revoke bias >>>>> could cause this. I wonder if the VM is swapping out. >>>>> >>>>> >>>>> >>>>> On 29 January 2014 19:02, Frank Ng <fnt...@gmail.com> wrote: >>>>> >>>>>> Thanks for the update. Our logs indicated that there were 0 pending >>>>>> for CompactionManager at that time. Also, there were no nodetool repairs >>>>>> running at that time. The log statements above state that the >>>>>> application >>>>>> had to stop to reach a safepoint. Yet, it doesn't say what is requesting >>>>>> the safepoint. >>>>>> >>>>>> >>>>>> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang < >>>>>> shaochuan.w...@bloomreach.com> wrote: >>>>>> >>>>>>> We had similar latency spikes when pending compactions can't keep it >>>>>>> up or repair/streaming taking too much cycles. >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng <fnt...@gmail.com> wrote: >>>>>>> >>>>>>>> All, >>>>>>>> >>>>>>>> We've been having intermittent long application pauses (version >>>>>>>> 1.2.8) and not sure if it's a cassandra bug. During these pauses, >>>>>>>> there >>>>>>>> are dropped messages in the cassandra log file along with the node >>>>>>>> seeing >>>>>>>> other nodes as down. We've turned on gc logging and the following is >>>>>>>> an >>>>>>>> example of a long "stopped" or pause event in the gc.log file. >>>>>>>> >>>>>>>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which >>>>>>>> application threads were stopped: 0.091450 seconds >>>>>>>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which >>>>>>>> application threads were stopped: 51.8190260 seconds >>>>>>>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which >>>>>>>> application threads were stopped: 0.005470 seconds >>>>>>>> >>>>>>>> As seen above, there was a 0.091450 secs pause, then a 51.8190260 >>>>>>>> secs pause. There were no GC log events between those 2 log >>>>>>>> statements. >>>>>>>> Since there's no GC logs in between, something else must be causing the >>>>>>>> long stop time to reach a safepoint. >>>>>>>> >>>>>>>> Could there be a Cassandra thread that is taking a long time to >>>>>>>> reach a safepoint and what is it trying to do? Along with the node >>>>>>>> seeing >>>>>>>> other nodes as down in the cassandra log file, the StatusLogger shows >>>>>>>> 1599 >>>>>>>> Pending in ReadStage and 9 Pending in MutationStage. >>>>>>>> >>>>>>>> There is mention of cassandra batch revoke bias locks as a possible >>>>>>>> cause (not GC) via: >>>>>>>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html >>>>>>>> >>>>>>>> We have JNA, no swap, and the cluster runs fine besides there >>>>>>>> intermittent long pause that can cause a node to appear down to other >>>>>>>> nodes. Any ideas as the cause of the long pause above? It seems not >>>>>>>> related to GC. >>>>>>>> >>>>>>>> thanks. >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >