Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-06-04 Thread Rainer Jung

On 29.05.2012 17:03, Kevin wrote:

Hi

We have a cluster of tomcat servers being used on a very high volume
website. We've noticed that for the first 5-6 hours after an application
re-load that Full GC will run every 2 minutes pausing the application for
anywhere between 5 and 20 seconds. After 5-6 hours the full GC will no
longer run until tomcat is reloaded.  This behavior is repeatable.

We discounted our EhCache settings as potential issue because tomcat can
be killed without this issue showing up.  It's not until we actually
reload the WAR with changes (clearing the work directory) and restart
tomcat that this issue shows up.  Traffic level isn't a factor we'll go
through our peak hours with no problem.


The servers are all dual quad core with 32GB of RAM running Centos 5 and
tomcat 6.0.18.  We've attempted every suggested GC setting change
suggested on the net and by developers and system administrators.  Our
java opts are being played with on a daily basis however here are the
settings that correspond to the sample GC log below:


-server -Xmx27g -Xms27g  -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC
-XX:+PrintTenuringDistribution  -Dsun.rmi.dgc.client.gcInterval=90
-Dsun.rmi.dgc.server.gcInterval=90 -XX:NewSize=8g -XX:SurvivorRatio=16
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails



343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), 3.4129290
secs] 2884580K->1312570K(27845568K), [CMS Perm : 83964K->47203K(83968K)],
3.4168600 secs] [Times: user=3.87 sys=0.02, real=3.41 secs]



20526.469: [Full GC 20526.475: [CMS: 9175755K->9210800K(19922944K),
33.1161300 secs] 13632232K->9210800K(27845568K), [CMS Perm :
83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 sys=0.02,
real=33.12 secs]


Looks to me as the oerm gen fills up and triggers the Full GC. Then perm 
gen is back to only be used by about 60% but rapidly fills up again. So 
it seems your are using a lot of classes that are only used for a short 
period of time.


Add

-XX:+TraceClassUnloading

which adds the names of the classes which are unloaded to the GC output. 
Then check, whether the names of the classes give you an idea, why you 
load and unload that many classes for a long time (and why it finally 
gets stable).


Regards,

Rainer


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-05-30 Thread Kevin


On 5/30/12 2:38 AM, "Pid"  wrote:

>On 29/05/2012 20:50, Kevin wrote:
>> 
>> 
>> On 5/29/12 11:09 AM, "Christopher Schultz"
>>
>> wrote:
>> 
>> Kevin,
>> 
>> (FYI these are always a PITA to diagnose...)
>> 
>> On 5/29/12 11:03 AM, Kevin wrote:
> We have a cluster of tomcat servers being used on a very high
> volume website. We've noticed that for the first 5-6 hours after an
> application re-load that Full GC will run every 2 minutes pausing
> the application for anywhere between 5 and 20 seconds. After 5-6
> hours the full GC will no longer run until tomcat is reloaded.
> This behavior is repeatable.
>
> We discounted our EhCache settings as potential issue because
> tomcat can be killed without this issue showing up.  It's not until
> we actually reload the WAR with changes (clearing the work
> directory) and restart tomcat that this issue shows up.  Traffic
> level isn't a factor we'll go through our peak hours with no
> problem.
>> 
>> All good things to know.
>> 
> The servers are all dual quad core with 32GB of RAM running Centos
> 5 and tomcat 6.0.18.  We've attempted every suggested GC setting
> change suggested on the net and by developers and system
> administrators.
>> 
>> Did you try looking at what the cause might be, or just blindly apply
>> suggestions to see if something would stick?
>> 
>>> Initially we looked closely at it trying to solve the problem as time
>>>wore
>>> on we tried anything.  This has been going on for months now.
>> 
>> 
>> 
> Our java opts are being played with on a daily basis however here
> are the settings that correspond to the sample GC log below:
>
>
> -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC
> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution
> -Dsun.rmi.dgc.client.gcInterval=90
> -Dsun.rmi.dgc.server.gcInterval=90 -XX:NewSize=8g
> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps
> -XX:+PrintGCDetails
>> 
>> A new size of 8GiB seems high to me, but I've never run with a total
>> heap size of anything nearly as large as 27GiB, so maybe that's a
>> reasonable size. My opinion here is probably irrelevant.
>> 
> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808
> bytes, new threshold 15 (max 15) - age   1:   50457968 bytes,
> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs]
> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68
> sys=0.00, real=0.07 secs]
>> 
>> Odd that user+sys < real. Hmm.
>> 
>> [...]
>> 
> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808
> bytes, new threshold 15 (max 15) - age   1:   41430416 bytes,
> 41430416 total - age   3:   22728376 bytes,   64158792 total - age
> 5:   19599960 bytes,   83758752 total - age   6:   21847616 bytes,
> 105606368 total - age   7:   27667592 bytes,  133273960 total - age
> 8:  10904 bytes,  133284864 total - age   9:   31824256 bytes,
> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs]
> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45
> sys=0.01, real=0.14 secs]
>> 
>> 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed
>> above. Have you been able to actually instrument a 5s-20s GC-pause?
>> Or, are you saying that your webapp appears to stall for that long and
>> those observances coincide with full-GC runs?
>> 
> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K),
> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm :
> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87
> sys=0.02, real=3.41 secs]
>> 
>> Also relatively short.
>> 
>>> They fluctuate.
>> 
>> 
> **Last Full GC**
>
> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808
> bytes, new threshold 15 (max 15) - age   1:   33948208 bytes,
> 33948208 total - age   2:  88280 bytes,   34036488 total - age
> 3:   19872472 bytes,   53908960 total - age   4:   16072608 bytes,
> 69981568 total - age   5:   15718712 bytes,   85700280 total - age
> 6:   15771016 bytes,  101471296 total - age   7:   16895976 bytes,
> 118367272 total - age   8:   24233728 bytes,  142601000 total :
> 7618727K->200950K(7922624K), 0.1728420 secs]
> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21
> sys=0.01, real=0.18 secs]
>> 
>> Note that the above is not a "Full GC". This one /is/:
>> 
> 20526.469: [Full GC 20526.475: [CMS:
> 9175755K->9210800K(19922944K), 33.1161300 secs]
> 13632232K->9210800K(27845568K), [CMS Perm :
> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12
> sys=0.02, real=33.12 secs]
>> 
>> Okay, that one's a doosey.
>> 
>> It's weird that it looks like after the full GC, the heap has *lost*
>> some space ("CMS: 9175755K->9210800K"). Can you watch the logs and
>> take a heap dump (which may take a LONG time with such a big heap,

Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-05-30 Thread Pid
On 29/05/2012 16:03, Kevin wrote:
> Hi
> 
> We have a cluster of tomcat servers being used on a very high volume
> website. We've noticed that for the first 5-6 hours after an application
> re-load that Full GC will run every 2 minutes pausing the application for
> anywhere between 5 and 20 seconds. After 5-6 hours the full GC will no
> longer run until tomcat is reloaded.  This behavior is repeatable.
>
> We discounted our EhCache settings as potential issue because tomcat can
> be killed without this issue showing up.  It's not until we actually
> reload the WAR with changes (clearing the work directory) and restart
> tomcat that this issue shows up.  Traffic level isn't a factor we'll go
> through our peak hours with no problem.
> 
> 
> The servers are all dual quad core with 32GB of RAM running Centos 5 and
> tomcat 6.0.18.  We've attempted every suggested GC setting change
> suggested on the net and by developers and system administrators.  Our
> java opts are being played with on a daily basis however here are the
> settings that correspond to the sample GC log below:

Exactly which version of Java are you using?


> -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC
> -XX:+PrintTenuringDistribution  -Dsun.rmi.dgc.client.gcInterval=90
> -Dsun.rmi.dgc.server.gcInterval=90 -XX:NewSize=8g -XX:SurvivorRatio=16
> -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

You are doing a hot deployment with a 27G heap?

How big is permgen?

Are you completely sure the application doesn't have any memory leaks?

Have you tried running your application on a more recent version of Tomcat?


p


> 215.501: [GC 215.504: [ParNew
> Desired survivor size 238583808 bytes, new threshold 15 (max 15)
> - age   1:   50457968 bytes,   50457968 total
> : 7456799K->111048K(7922624K), 0.0617110 secs]
> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68 sys=0.00,
> real=0.07 secs]
> 
> 215.577: [GC 215.579: [ParNew
> Desired survivor size 238583808 bytes, new threshold 15 (max 15)
> - age   1:  66288 bytes,  66288 total
> - age   2:   50219144 bytes,   50285432 total
> : 114868K->66525K(7922624K), 0.0381810 secs]
> 1272975K->1224632K(27845568K), 0.0413630 secs] [Times: user=0.46 sys=0.00,
> real=0.04 secs]
> 
> 236.177: [GC 236.180: [ParNew
> Desired survivor size 238583808 bytes, new threshold 15 (max 15)
> - age   1:   45071064 bytes,   45071064 total
> - age   2:  26112 bytes,   45097176 total
> - age   3:   34785960 bytes,   79883136 total
> : 7523165K->110355K(7922624K), 0.0921350 secs]
> 8681272K->1268462K(27845568K), 0.0969290 secs] [Times: user=0.95 sys=0.01,
> real=0.10 secs]
> 
> ...
> 
> 316.456: [GC 316.459: [ParNew
> Desired survivor size 238583808 bytes, new threshold 15 (max 15)
> - age   1:   41430416 bytes,   41430416 total
> - age   3:   22728376 bytes,   64158792 total
> - age   5:   19599960 bytes,   83758752 total
> - age   6:   21847616 bytes,  105606368 total
> - age   7:   27667592 bytes,  133273960 total
> - age   8:  10904 bytes,  133284864 total
> - age   9:   31824256 bytes,  165109120 total
> : 7650333K->215213K(7922624K), 0.1332630 secs]
> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45 sys=0.01,
> real=0.14 secs]
> 
> 338.851: [GC 338.854: [ParNew
> Desired survivor size 238583808 bytes, new threshold 15 (max 15)
> - age   1:   40678840 bytes,   40678840 total
> - age   2:   27075936 bytes,   67754776 total
> - age   4:   20399720 bytes,   88154496 total
> - age   6:   19271008 bytes,  107425504 total
> - age   7:   21655032 bytes,  129080536 total
> - age   8:   27118800 bytes,  156199336 total
> - age   9:  10904 bytes,  156210240 total
> - age  10:   31747808 bytes,  187958048 total
> : 7671853K->285541K(7922624K), 0.1456470 secs]
> 8829960K->1443648K(27845568K), 0.1503540 secs] [Times: user=1.62 sys=0.01,
> real=0.15 secs]
> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), 3.4129290
> secs] 2884580K->1312570K(27845568K), [CMS Perm : 83964K->47203K(83968K)],
> 3.4168600 secs] [Times: user=3.87 sys=0.02, real=3.41 secs]
> 
> **Last Full GC**
> 
> 20517.892: [GC 20517.898: [ParNew
> Desired survivor size 238583808 bytes, new threshold 15 (max 15)
> - age   1:   33948208 bytes,   33948208 total
> - age   2:  88280 bytes,   34036488 total
> - age   3:   19872472 bytes,   53908960 total
> - age   4:   16072608 bytes,   69981568 total
> - age   5:   15718712 bytes,   85700280 total
> - age   6:   15771016 bytes,  101471296 total
> - age   7:   16895976 bytes,  118367272 total
> - age   8:   24233728 bytes,  142601000 total
> : 7618727K->200950K(7922624K), 0.1728420 secs]
> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21
> sys=0.01, real=0.18 secs]
> 
> 20526.469: [Full GC 20526.475: [CMS: 9175755K->9210800K(19922944K),
> 33.1161300 secs] 13632232K->9210800K(27845568K), [CMS Perm :
> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 sys=0.02,
> real=33.12 secs]
> 
> 
> 
> **

Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-05-30 Thread Pid
On 29/05/2012 20:50, Kevin wrote:
> 
> 
> On 5/29/12 11:09 AM, "Christopher Schultz" 
> wrote:
> 
> Kevin,
> 
> (FYI these are always a PITA to diagnose...)
> 
> On 5/29/12 11:03 AM, Kevin wrote:
 We have a cluster of tomcat servers being used on a very high
 volume website. We've noticed that for the first 5-6 hours after an
 application re-load that Full GC will run every 2 minutes pausing
 the application for anywhere between 5 and 20 seconds. After 5-6
 hours the full GC will no longer run until tomcat is reloaded.
 This behavior is repeatable.

 We discounted our EhCache settings as potential issue because
 tomcat can be killed without this issue showing up.  It's not until
 we actually reload the WAR with changes (clearing the work
 directory) and restart tomcat that this issue shows up.  Traffic
 level isn't a factor we'll go through our peak hours with no
 problem.
> 
> All good things to know.
> 
 The servers are all dual quad core with 32GB of RAM running Centos
 5 and tomcat 6.0.18.  We've attempted every suggested GC setting
 change suggested on the net and by developers and system
 administrators.
> 
> Did you try looking at what the cause might be, or just blindly apply
> suggestions to see if something would stick?
> 
>> Initially we looked closely at it trying to solve the problem as time wore
>> on we tried anything.  This has been going on for months now.
> 
> 
> 
 Our java opts are being played with on a daily basis however here
 are the settings that correspond to the sample GC log below:


 -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC
 -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution
 -Dsun.rmi.dgc.client.gcInterval=90
 -Dsun.rmi.dgc.server.gcInterval=90 -XX:NewSize=8g
 -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps
 -XX:+PrintGCDetails
> 
> A new size of 8GiB seems high to me, but I've never run with a total
> heap size of anything nearly as large as 27GiB, so maybe that's a
> reasonable size. My opinion here is probably irrelevant.
> 
 215.501: [GC 215.504: [ParNew Desired survivor size 238583808
 bytes, new threshold 15 (max 15) - age   1:   50457968 bytes,
 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs]
 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68
 sys=0.00, real=0.07 secs]
> 
> Odd that user+sys < real. Hmm.
> 
> [...]
> 
 316.456: [GC 316.459: [ParNew Desired survivor size 238583808
 bytes, new threshold 15 (max 15) - age   1:   41430416 bytes,
 41430416 total - age   3:   22728376 bytes,   64158792 total - age
 5:   19599960 bytes,   83758752 total - age   6:   21847616 bytes,
 105606368 total - age   7:   27667592 bytes,  133273960 total - age
 8:  10904 bytes,  133284864 total - age   9:   31824256 bytes,
 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs]
 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45
 sys=0.01, real=0.14 secs]
> 
> 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed
> above. Have you been able to actually instrument a 5s-20s GC-pause?
> Or, are you saying that your webapp appears to stall for that long and
> those observances coincide with full-GC runs?
> 
 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K),
 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm :
 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87
 sys=0.02, real=3.41 secs]
> 
> Also relatively short.
> 
>> They fluctuate. 
> 
> 
 **Last Full GC**

 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808
 bytes, new threshold 15 (max 15) - age   1:   33948208 bytes,
 33948208 total - age   2:  88280 bytes,   34036488 total - age
 3:   19872472 bytes,   53908960 total - age   4:   16072608 bytes,
 69981568 total - age   5:   15718712 bytes,   85700280 total - age
 6:   15771016 bytes,  101471296 total - age   7:   16895976 bytes,
 118367272 total - age   8:   24233728 bytes,  142601000 total :
 7618727K->200950K(7922624K), 0.1728420 secs]
 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21
 sys=0.01, real=0.18 secs]
> 
> Note that the above is not a "Full GC". This one /is/:
> 
 20526.469: [Full GC 20526.475: [CMS:
 9175755K->9210800K(19922944K), 33.1161300 secs]
 13632232K->9210800K(27845568K), [CMS Perm :
 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12
 sys=0.02, real=33.12 secs]
> 
> Okay, that one's a doosey.
> 
> It's weird that it looks like after the full GC, the heap has *lost*
> some space ("CMS: 9175755K->9210800K"). Can you watch the logs and
> take a heap dump (which may take a LONG time with such a big heap, so
> be prepared for another long pause) after one of these full GCs?
> 
> It would be better to get a heap dump both before AND after the full
> GC, but th

Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-05-29 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kevin,

On 5/29/12 3:50 PM, Kevin wrote:
> On 5/29/12 11:09 AM, "Christopher Schultz"
>  wrote:
>> It would be better to get a heap dump both before AND after the
>> full GC, but that can ruin your performance if you enable those
>> automated GC options.
> 
> We can run this after our next code release in two days.

Okay.

>> Are you populating huge caches on startup or anything like that? 
>> Generational GC performance is related to the number of
>> surviving objects that have to be moved from one generation to
>> the next, so if you have a huge population of objects surviving
>> each collection one after the other (which is what I might expect
>> if you had an enormous cache being fulled all at once) then your
>> performance will suck for a while until those objects are tenured
>> enough to avoid periodic collections.
> 
> This was our initial assumption however since we're using an in
> memory EhCache we would expect this issue to re-accure when tomcat
> is restarted. It doesn't.  It'll only show up if we stop tomcat,
> reload the WAR and clean the work directory.

Have you tried re-starting your webapp without restarting Tomcat? Does
EhCache do any kind of magic optimization on the disk, where the
removal of the 'work' directory would have some direct consequence?

>> Are you able to inspect the various generations before and after
>> these pauses? I'd be interested to see if you have a huge
>> population of objects being tenured at once and what types of
>> objects they are (they might be something obvious like instances
>> of our.big.fat.CacheEntry or something like that).
> 
> We'll give it a try

With any (relative) luck, you'll have another event where everything
sucks for 2 hours, you'll get a huge full GC with heap dumps on either
side of it, and then you can inspect them. Again, remember that full
heap dumps around full GC's is going to murder the performance of your
full GCs (yes, it can get worse). Unfortunately, that also means that
disabling those options means another JVM restart which means another
2 hours of agony. The good news (I guess) is that you can do all these
restarts during off-hours (do you have any off-hours) and not impact
your users too much.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk/FM3AACgkQ9CaO5/Lv0PCeQwCgl5XmoWGO5COKBa/XC/L596yA
lX8An1va+s88s6jtkdUYVzAJBV6fqNJk
=mwwO
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-05-29 Thread Kevin


On 5/29/12 11:09 AM, "Christopher Schultz" 
wrote:

>-BEGIN PGP SIGNED MESSAGE-
>Hash: SHA1
>
>Kevin,
>
>(FYI these are always a PITA to diagnose...)
>
>On 5/29/12 11:03 AM, Kevin wrote:
>> We have a cluster of tomcat servers being used on a very high
>> volume website. We've noticed that for the first 5-6 hours after an
>> application re-load that Full GC will run every 2 minutes pausing
>> the application for anywhere between 5 and 20 seconds. After 5-6
>> hours the full GC will no longer run until tomcat is reloaded.
>> This behavior is repeatable.
>> 
>> We discounted our EhCache settings as potential issue because
>> tomcat can be killed without this issue showing up.  It's not until
>> we actually reload the WAR with changes (clearing the work
>> directory) and restart tomcat that this issue shows up.  Traffic
>> level isn't a factor we'll go through our peak hours with no
>> problem.
>
>All good things to know.
>
>> The servers are all dual quad core with 32GB of RAM running Centos
>> 5 and tomcat 6.0.18.  We've attempted every suggested GC setting
>> change suggested on the net and by developers and system
>> administrators.
>
>Did you try looking at what the cause might be, or just blindly apply
>suggestions to see if something would stick?

Initially we looked closely at it trying to solve the problem as time wore
on we tried anything.  This has been going on for months now.


>
>> Our java opts are being played with on a daily basis however here
>> are the settings that correspond to the sample GC log below:
>> 
>> 
>> -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC
>> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution
>> -Dsun.rmi.dgc.client.gcInterval=90
>> -Dsun.rmi.dgc.server.gcInterval=90 -XX:NewSize=8g
>> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps
>> -XX:+PrintGCDetails
>
>A new size of 8GiB seems high to me, but I've never run with a total
>heap size of anything nearly as large as 27GiB, so maybe that's a
>reasonable size. My opinion here is probably irrelevant.
>
>> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808
>> bytes, new threshold 15 (max 15) - age   1:   50457968 bytes,
>> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs]
>> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68
>> sys=0.00, real=0.07 secs]
>
>Odd that user+sys < real. Hmm.
>
>[...]
>
>> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808
>> bytes, new threshold 15 (max 15) - age   1:   41430416 bytes,
>> 41430416 total - age   3:   22728376 bytes,   64158792 total - age
>> 5:   19599960 bytes,   83758752 total - age   6:   21847616 bytes,
>> 105606368 total - age   7:   27667592 bytes,  133273960 total - age
>> 8:  10904 bytes,  133284864 total - age   9:   31824256 bytes,
>> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs]
>> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45
>> sys=0.01, real=0.14 secs]
>
>1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed
>above. Have you been able to actually instrument a 5s-20s GC-pause?
>Or, are you saying that your webapp appears to stall for that long and
>those observances coincide with full-GC runs?
>
>> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K),
>> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm :
>> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87
>> sys=0.02, real=3.41 secs]
>
>Also relatively short.

They fluctuate. 

>
>> **Last Full GC**
>> 
>> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808
>> bytes, new threshold 15 (max 15) - age   1:   33948208 bytes,
>> 33948208 total - age   2:  88280 bytes,   34036488 total - age
>> 3:   19872472 bytes,   53908960 total - age   4:   16072608 bytes,
>> 69981568 total - age   5:   15718712 bytes,   85700280 total - age
>> 6:   15771016 bytes,  101471296 total - age   7:   16895976 bytes,
>> 118367272 total - age   8:   24233728 bytes,  142601000 total :
>> 7618727K->200950K(7922624K), 0.1728420 secs]
>> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21
>> sys=0.01, real=0.18 secs]
>
>Note that the above is not a "Full GC". This one /is/:
>
>> 20526.469: [Full GC 20526.475: [CMS:
>> 9175755K->9210800K(19922944K), 33.1161300 secs]
>> 13632232K->9210800K(27845568K), [CMS Perm :
>> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12
>> sys=0.02, real=33.12 secs]
>
>Okay, that one's a doosey.
>
>It's weird that it looks like after the full GC, the heap has *lost*
>some space ("CMS: 9175755K->9210800K"). Can you watch the logs and
>take a heap dump (which may take a LONG time with such a big heap, so
>be prepared for another long pause) after one of these full GCs?
>
>It would be better to get a heap dump both before AND after the full
>GC, but that can ruin your performance if you enable those automated
>GC options.

We can run this after our next code release in two days.

>
>> Any help would be greatly appreciated.
>

Re: tomcat full GC every 2 minutes for first 6 hours after war reload

2012-05-29 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kevin,

(FYI these are always a PITA to diagnose...)

On 5/29/12 11:03 AM, Kevin wrote:
> We have a cluster of tomcat servers being used on a very high
> volume website. We've noticed that for the first 5-6 hours after an
> application re-load that Full GC will run every 2 minutes pausing
> the application for anywhere between 5 and 20 seconds. After 5-6
> hours the full GC will no longer run until tomcat is reloaded.
> This behavior is repeatable.
> 
> We discounted our EhCache settings as potential issue because
> tomcat can be killed without this issue showing up.  It's not until
> we actually reload the WAR with changes (clearing the work
> directory) and restart tomcat that this issue shows up.  Traffic
> level isn't a factor we'll go through our peak hours with no
> problem.

All good things to know.

> The servers are all dual quad core with 32GB of RAM running Centos
> 5 and tomcat 6.0.18.  We've attempted every suggested GC setting
> change suggested on the net and by developers and system
> administrators.

Did you try looking at what the cause might be, or just blindly apply
suggestions to see if something would stick?

> Our java opts are being played with on a daily basis however here
> are the settings that correspond to the sample GC log below:
> 
> 
> -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC
> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution
> -Dsun.rmi.dgc.client.gcInterval=90 
> -Dsun.rmi.dgc.server.gcInterval=90 -XX:NewSize=8g
> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps
> -XX:+PrintGCDetails

A new size of 8GiB seems high to me, but I've never run with a total
heap size of anything nearly as large as 27GiB, so maybe that's a
reasonable size. My opinion here is probably irrelevant.

> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808
> bytes, new threshold 15 (max 15) - age   1:   50457968 bytes,
> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs] 
> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68
> sys=0.00, real=0.07 secs]

Odd that user+sys < real. Hmm.

[...]

> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808
> bytes, new threshold 15 (max 15) - age   1:   41430416 bytes,
> 41430416 total - age   3:   22728376 bytes,   64158792 total - age
> 5:   19599960 bytes,   83758752 total - age   6:   21847616 bytes,
> 105606368 total - age   7:   27667592 bytes,  133273960 total - age
> 8:  10904 bytes,  133284864 total - age   9:   31824256 bytes,
> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs] 
> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45
> sys=0.01, real=0.14 secs]

1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed
above. Have you been able to actually instrument a 5s-20s GC-pause?
Or, are you saying that your webapp appears to stall for that long and
those observances coincide with full-GC runs?

> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K),
> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm :
> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87
> sys=0.02, real=3.41 secs]

Also relatively short.

> **Last Full GC**
> 
> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808
> bytes, new threshold 15 (max 15) - age   1:   33948208 bytes,
> 33948208 total - age   2:  88280 bytes,   34036488 total - age
> 3:   19872472 bytes,   53908960 total - age   4:   16072608 bytes,
> 69981568 total - age   5:   15718712 bytes,   85700280 total - age
> 6:   15771016 bytes,  101471296 total - age   7:   16895976 bytes,
> 118367272 total - age   8:   24233728 bytes,  142601000 total :
> 7618727K->200950K(7922624K), 0.1728420 secs] 
> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21 
> sys=0.01, real=0.18 secs]

Note that the above is not a "Full GC". This one /is/:

> 20526.469: [Full GC 20526.475: [CMS:
> 9175755K->9210800K(19922944K), 33.1161300 secs]
> 13632232K->9210800K(27845568K), [CMS Perm : 
> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12
> sys=0.02, real=33.12 secs]

Okay, that one's a doosey.

It's weird that it looks like after the full GC, the heap has *lost*
some space ("CMS: 9175755K->9210800K"). Can you watch the logs and
take a heap dump (which may take a LONG time with such a big heap, so
be prepared for another long pause) after one of these full GCs?

It would be better to get a heap dump both before AND after the full
GC, but that can ruin your performance if you enable those automated
GC options.

> Any help would be greatly appreciated.

Are you populating huge caches on startup or anything like that?
Generational GC performance is related to the number of surviving
objects that have to be moved from one generation to the next, so if
you have a huge population of objects surviving each collection one
after the other (which is what I might expect if you had an enormous
cache being fulled all at once) then your