Hi Frank, We got a (quite) long GC pause today on 2.0.5: INFO [ScheduledTasks:1] 2014-02-20 13:51:14,528 GCInspector.java (line 116) GC for ParNew: 1627 ms for 1 collections, 425562984 used; max is 4253024256 INFO [ScheduledTasks:1] 2014-02-20 13:51:14,542 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 3703 ms for 2 collections, 434394920 used; max is 4253024256
Unfortunately it's a production cluster so I have no additional GC-logging enabled. This may be an indication that upgrading is not the (complete) solution. Regards, Joel 2014-02-17 13:41 GMT+01:00 Benedict Elliott Smith < belliottsm...@datastax.com>: > Hi Ondrej, > > It's possible you were hit by the problems in this thread before, but it > looks potentially like you may have other issues. Of course it may be that > on G1 you have one issue and CMS another, but 27s is extreme even for G1, > so it seems unlikely. If you're hitting these pause times in CMS and you > get some more output from the safepoint tracing, please do contribute as I > would love to get to the bottom of that, however is it possible you're > experiencing paging activity? Have you made certain the VM memory is locked > (and preferably that paging is entirely disabled, as the bloom filters and > other memory won't be locked, although that shouldn't cause pauses during > GC) > > Note that mmapped file accesses and other native work shouldn't in anyway > inhibit GC activity or other safepoint pause times, unless there's a bug in > the VM. These threads will simply enter a safepoint as they return to the > VM execution context, and are considered safe for the duration they are > outside. > > > > > On 17 February 2014 12:30, Ondřej Černoš <cern...@gmail.com> wrote: > >> Hi, >> >> we tried to switch to G1 because we observed this behaviour on CMS too >> (27 seconds pause in G1 is quite an advise not to use it). Pauses with CMS >> were not easily traceable - JVM stopped even without stop-the-world pause >> scheduled (defragmentation, remarking). We thought the go-to-safepoint >> waiting time might have been involved (we saw waiting for safepoint >> resolution) - especially because access to mmpaped files is not preemptive, >> afaik, but it doesn't explain tens of seconds waiting times, even slow IO >> should read our sstables into memory in much less time. We switched to G1 >> out of desperation - and to try different code paths - not that we'd >> thought it was a great idea. So I think we were hit by the problem >> discussed in this thread, just the G1 report wasn't very clear, sorry. >> >> regards, >> ondrej >> >> >> >> On Mon, Feb 17, 2014 at 11:45 AM, Benedict Elliott Smith < >> belliottsm...@datastax.com> wrote: >> >>> Ondrej, >>> >>> It seems like your issue is much less difficult to diagnose: your >>> collection times are long. At least, the pause you printed the time for is >>> all attributable to the G1 pause. >>> >>> Note that G1 has not generally performed well with Cassandra in our >>> testing. There are a number of changes going in soon that may change that, >>> but for the time being it is advisable to stick with CMS. With tuning you >>> can no doubt bring your pauses down considerably. >>> >>> >>> On 17 February 2014 10:17, Ondřej Černoš <cern...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> we are seeing the same kind of long pauses in Cassandra. We tried to >>>> switch CMS to G1 without positive result. The stress test is read heavy, 2 >>>> datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in >>>> latency on 99.99 percentil and higher, caused by threads being stopped in >>>> JVM. >>>> >>>> The GC in G1 looks like this: >>>> >>>> {Heap before GC invocations=4073 (full 1): >>>> garbage-first heap total 8388608K, used 3602914K [0x00000005f5c00000, >>>> 0x00000007f5c00000, 0x00000007f5c00000) >>>> region size 4096K, 142 young (581632K), 11 survivors (45056K) >>>> compacting perm gen total 28672K, used 27428K [0x00000007f5c00000, >>>> 0x00000007f7800000, 0x0000000800000000) >>>> the space 28672K, 95% used [0x00000007f5c00000, 0x00000007f76c9108, >>>> 0x00000007f76c9200, 0x00000007f7800000) >>>> No shared spaces configured. >>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation >>>> Pause) (young) >>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15) >>>> - age 1: 17213632 bytes, 17213632 total >>>> - age 2: 19391208 bytes, 36604840 total >>>> , 0.1664300 secs] >>>> [Parallel Time: 163.9 ms, GC Workers: 2] >>>> [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max: >>>> 222346218.3, Diff: 0.0] >>>> [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7, >>>> Sum: 13.7] >>>> [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum: >>>> 42.6] >>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum: >>>> 120] >>>> [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum: >>>> 46.5] >>>> [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1, >>>> Sum: 224.6] >>>> [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: >>>> 0.1] >>>> [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, >>>> Sum: 0.1] >>>> [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff: >>>> 0.0, Sum: 327.6] >>>> [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max: >>>> 222346382.1, Diff: 0.0] >>>> [Code Root Fixup: 0.0 ms] >>>> [Clear CT: 0.4 ms] >>>> [Other: 2.1 ms] >>>> [Choose CSet: 0.0 ms] >>>> [Ref Proc: 1.1 ms] >>>> [Ref Enq: 0.0 ms] >>>> [Free CSet: 0.4 ms] >>>> [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap: >>>> 3518.5M(8192.0M)->3018.5M(8192.0M)] >>>> Heap after GC invocations=4074 (full 1): >>>> garbage-first heap total 8388608K, used 3090914K [0x00000005f5c00000, >>>> 0x00000007f5c00000, 0x00000007f5c00000) >>>> region size 4096K, 17 young (69632K), 17 survivors (69632K) >>>> compacting perm gen total 28672K, used 27428K [0x00000007f5c00000, >>>> 0x00000007f7800000, 0x0000000800000000) >>>> the space 28672K, 95% used [0x00000007f5c00000, 0x00000007f76c9108, >>>> 0x00000007f76c9200, 0x00000007f7800000) >>>> No shared spaces configured. >>>> } >>>> [Times: user=0.35 sys=0.00, real=27.58 secs] >>>> 222346.219: G1IncCollectionPause [ 111 0 >>>> 0 ] [ 0 0 0 0 27586 ] 0 >>>> >>>> And the total thime for which application threads were stopped is 27.58 >>>> seconds. >>>> >>>> CMS behaves in a similar manner. We thought it would be GC, waiting for >>>> mmaped files being read from disk (the thread cannot reach safepoint during >>>> this operation), but it doesn't explain the huge time. >>>> >>>> We'll try jhiccup to see if it provides any additional information. The >>>> test was done on mixed aws/openstack environment, openjdk 1.7.0_45, >>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us. >>>> >>>> regards, >>>> >>>> ondrej cernos >>>> >>>> >>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng <fnt...@gmail.com> wrote: >>>> >>>>> Sorry, I have not had a chance to file a JIRA ticket. We have not >>>>> been able to resolve the issue. But since Joel mentioned that upgrading >>>>> to >>>>> Cassandra 2.0.X solved it for them, we may need to upgrade. We are >>>>> currently on Java 1.7 and Cassandra 1.2.8 >>>>> >>>>> >>>>> >>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright >>>>> <kwri...@nanigans.com>wrote: >>>>> >>>>>> You’re running 2.0.* in production? May I ask what C* version and >>>>>> OS? Any hardware details would be appreciated as well. Thx! >>>>>> >>>>>> From: Joel Samuelsson <samuelsson.j...@gmail.com> >>>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>> Date: Thursday, February 13, 2014 at 11:39 AM >>>>>> >>>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>> Subject: Re: Intermittent long application pauses on nodes >>>>>> >>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 >>>>>> seems to have helped our issues. >>>>>> >>>>>> >>>>>> 2014-02-13 Keith Wright <kwri...@nanigans.com>: >>>>>> >>>>>>> Frank did you ever file a ticket for this issue or find the root >>>>>>> cause? I believe we are seeing the same issues when attempting to >>>>>>> bootstrap. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> From: Robert Coli <rc...@eventbrite.com> >>>>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>> Date: Monday, February 3, 2014 at 6:10 PM >>>>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>> Subject: Re: Intermittent long application pauses on nodes >>>>>>> >>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith < >>>>>>> belliottsm...@datastax.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> It's possible that this is a JVM issue, but if so there may be some >>>>>>>> remedial action we can take anyway. There are some more flags we should >>>>>>>> add, but we can discuss that once you open a ticket. If you could >>>>>>>> include >>>>>>>> the strange JMX error as well, that might be helpful. >>>>>>>> >>>>>>> >>>>>>> It would be appreciated if you could inform this thread of the JIRA >>>>>>> ticket number, for the benefit of the community and google searchers. :) >>>>>>> >>>>>>> =Rob >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >