Re: Long ParNew GC pauses - even when young generation is small

2014-07-07 Thread Shawn Heisey
On 7/7/2014 10:22 AM, aferdous wrote:
> Hi Shawn - I was just wondering how did you resolve this issue in the end. We
> are seeing the same issue with our platform (similar heap size) and updater
> volume.
>
> It would be nice if you could provide us with your final findings/configs.

I use the following GC tuning parameters now, with FAR better results:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

The G1 collector performed worse for me than untuned CMS.  The average
collection was awesome, but when it finally did do a stop-the-world,
those were even longer.

Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2014-07-07 Thread aferdous
Hi Shawn - I was just wondering how did you resolve this issue in the end. We
are seeing the same issue with our platform (similar heap size) and updater
volume.

It would be nice if you could provide us with your final findings/configs.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4145938.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread giltene
If you believe the logs, using -XX:+PrintGCApplicationStoppedTime is probably
the easiest way to avoid having to try to parse pause times from various
formats. But remember, GC logs can [often unintentionally] lie (I've seen
them under-report by multi-second gaps).

If you want to actually measure your JVM pauses (GC or others), you can use
something like jHiccup (http://www.azulsystems.com/jHiccup). It is a free
(as in beer) and public domain (CC0) tool that will show you any
blip/glitch/hiccup that you jvm experiences while running your application,
and report it in both time based and detailed percentile form. What jHccup
shows you is a "best-case" response time for your applicartion as it runs
(the response time the application would have shown if completed everything
as zero work).

It's near-trivial to add jHiccup to your environment (as either a java agent
or wrapper script). It would be interesting to see the percentile histograms
(jHiccup's .hgrm text output) for your environment.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034934.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread giltene
> I don't see any info on your website about pricing, so I can't make any 
> decisions about whether it would be right for me.  Can you give me 
> long-term pricing information?

As is the case with much of enterprise software (including getting a
supported version of Oracle HotSpot), this is a sales-person conversation
that we'd be happy to have. You can ask for someone to contact you about
this right on the site, or if you want, you can contact me at gil 
azulsystems  com and I'll make sure we get you the information you
need.

> Chances are that once I inform management of the cost, it'd never fly.

You may be surprised. You seem to assume that Zing is expensive for some
reason, while it's probably on par or cheaper than other supported JVMs for
this sort of thing. It's certainly "flown with management" for others
running into the exact same problems with both Solr and Lucene. Saved them
both time and money in the process of forever removing GC headaches.
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034932.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey
Unfortunately, G1 on Java 6 was a bust.  Several times GC pauses made my 
load balancer think the server was down, just like with CMS/ParNew.


Either there's something about my production query patterns that doesn't 
get along with any of the garbage collection methods, or I need to 
upgrade to Java 7.


I have tried lowering my max heap before.  That results in OOM problems 
when I do full-import with DIH.




On 1/20/2013 2:13 PM, Markus Jelsma wrote:

Hi Shawn,

Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per 
node) we saw decreased throughput and higher latency with G1 on Java 6. You can 
also expect higher CPU consumption. You can check it very well with VisualVM 
attached.

Looking forward to your results.

Markus



-Original message-

From:Shawn Heisey 
Sent: Sun 20-Jan-2013 21:48
To: solr-user@lucene.apache.org
Subject: Re: Long ParNew GC pauses - even when young generation is small

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

I may try the G1 collector with Java 6 in production, since I am on the
newest Oracle version.


I am giving this a try on my secondary server set.  An encouraging note:
The -XX:+UnlockExperimentalVMOptions option is no longer required to use
the G1 collector, at least on version 6u38.

Thanks,
Shawn






Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/20/2013 2:13 PM, Markus Jelsma wrote:

Hi Shawn,

Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per 
node) we saw decreased throughput and higher latency with G1 on Java 6. You can 
also expect higher CPU consumption. You can check it very well with VisualVM 
attached.

Looking forward to your results.


I don't have any really good test tools developed for testing throughput 
and latency.  I have some less-than-ideal tools for other purposes that 
I might be able to adapt.


Throughput is not a major issue for us - query volume is quite low.  I 
would be mildly surprised by 5 queries per second.  I don't have much of 
an idea of queries per second over the short term - the numbers 
available in 3.5 are limited.


As for latency, early indications from an old SOLR-1972 patch suggest 
that the QTime values might be a little higher.  The primary server 
stats (using CMS/ParNew) are over 1 million queries, and the secondary 
server stats (using G1) so far are only about 5000 queries.  The QTime 
values are steadily dropping as the number of queries goes up.


Here's a status page that gathers all the stats.  Chain A is using 
CMS/ParNew and is no longer receiving queries.  All the queries are now 
going to chain B, which is using G1.


http://dl.dropbox.com/u/97770508/g1-vs-cms-stats.png

The server CPU utilization graph doesn't have enough information yet to 
make any determination, but what little data is visible suggests that 
CPU may be higher.  The secondary servers also have slightly slower CPUs 
than the primary servers.  I was forced to make concessions on later 
purchases to keep the cost down.


Thanks,
Shawn



RE: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Markus Jelsma
Hi Shawn,

Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per 
node) we saw decreased throughput and higher latency with G1 on Java 6. You can 
also expect higher CPU consumption. You can check it very well with VisualVM 
attached. 

Looking forward to your results.

Markus

 
 
-Original message-
> From:Shawn Heisey 
> Sent: Sun 20-Jan-2013 21:48
> To: solr-user@lucene.apache.org
> Subject: Re: Long ParNew GC pauses - even when young generation is small
> 
> On 1/18/2013 10:07 PM, Shawn Heisey wrote:
> > I may try the G1 collector with Java 6 in production, since I am on the
> > newest Oracle version.
> 
> I am giving this a try on my secondary server set.  An encouraging note: 
> The -XX:+UnlockExperimentalVMOptions option is no longer required to use 
> the G1 collector, at least on version 6u38.
> 
> Thanks,
> Shawn
> 
> 


Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

I may try the G1 collector with Java 6 in production, since I am on the
newest Oracle version.


I am giving this a try on my secondary server set.  An encouraging note: 
The -XX:+UnlockExperimentalVMOptions option is no longer required to use 
the G1 collector, at least on version 6u38.


Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/20/2013 11:33 AM, Shawn Heisey wrote:

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

On my dev 4.1 server with Java 7u11, I am using the G1 collector with a
max pause target of 1500ms.  I was thinking that this collector was
producing long pauses too, but after reviewing the gc log with a closer
eye, I see that there are lines that specifically say "pause" ... and
all of THOSE lines are below half a second except one that took 1.4
seconds.  Does that mean that it's actually meeting the target, or are
the other lines that show quite long time values indicative of a
problem?  If only the lines that explicitly say "pause" are the ones I
need to worry about, then it looks like G1 is the clear winner.


Here's a paste showing a command and its output.  I included "remark" in
the grep because I saw a presentation saying that remark in G1 is
stop-the-world:

http://pastie.org/private/vygpvtjzicsl8uztg3drw

None of the matching log lines get close to my 5 second pain point.  If
I check the entire unfiltered log for lines that exceed 3 seconds, I do
find a few, but only one of them says "pause" and it's far enough below
the 5 second level that it probably would not cause a problem:

http://pastie.org/private/wcessvbrditextxmoapksq

Here's the perl script used in the two outputs above:

http://pastie.org/private/itu9hbgiwugdjtmy3yg8g


Here's the full gc log for anyone that feels compelled to fully investigate:

http://dl.dropbox.com/u/97770508/gc.log

Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-20 Thread Shawn Heisey

On 1/18/2013 10:07 PM, Shawn Heisey wrote:

On my dev 4.1 server with Java 7u11, I am using the G1 collector with a
max pause target of 1500ms.  I was thinking that this collector was
producing long pauses too, but after reviewing the gc log with a closer
eye, I see that there are lines that specifically say "pause" ... and
all of THOSE lines are below half a second except one that took 1.4
seconds.  Does that mean that it's actually meeting the target, or are
the other lines that show quite long time values indicative of a
problem?  If only the lines that explicitly say "pause" are the ones I
need to worry about, then it looks like G1 is the clear winner.


Here's a paste showing a command and its output.  I included "remark" in 
the grep because I saw a presentation saying that remark in G1 is 
stop-the-world:


http://pastie.org/private/vygpvtjzicsl8uztg3drw

None of the matching log lines get close to my 5 second pain point.  If 
I check the entire unfiltered log for lines that exceed 3 seconds, I do 
find a few, but only one of them says "pause" and it's far enough below 
the 5 second level that it probably would not cause a problem:


http://pastie.org/private/wcessvbrditextxmoapksq

Here's the perl script used in the two outputs above:

http://pastie.org/private/itu9hbgiwugdjtmy3yg8g

The log was gathered during a full-import of six large shards, over 12 
million docs each.  The import took 7 hours.  I had the patches for 
LUCENE-4599 (Compressed TermVectors) applied to Solr 4.1 at the time.


What I'd like to know is whether a 'concurrent-mark-end' line indicates 
stop-the-world or not.  I suspect that it is done while the application 
is working.  If this is right, then I think I have found the right GC 
settings:


-XX:+UseG1GC -XX:MaxGCPauseMillis=1500 -XX:GCPauseIntervalMillis=4000

My production servers have more total memory, more CPU cores, and much 
faster I/O than the dev server where I have been running these tests, 
but they both use the same 8GB java heap.  One last question: Should I 
be worried about using the G1 collector on Oracle Java 6u38, which was 
released at the same time as 7u11?  This *might* be a good opportunity 
to upgrade to java 7 in production, actually.  I have two completely 
independent index chains, I could upgrade the secondary.


If anyone has any suggestions for my GC parsing perl script, or a knows 
about a much more functional replacement, let me know.


Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-18 Thread Shawn Heisey

On 1/18/2013 8:37 PM, Mark Miller wrote:


On Jan 6, 2013, at 5:41 PM, Shawn Heisey  wrote:


Clarification of my question and my goals:

What I *want* is for all GC pauses to be half a second or less.


I'd try working with the concurrent, low pause collector. Any of the stop the 
world collectors mixed with a large heap will likely mean a few second pauses 
at least at some points. A well tuned concurrent collector will never step the 
world in most situations.

-XX:+UseConcMarkSweepGC

I wrote an article that might be useful a while back: 
http://searchhub.org/2011/03/27/garbage-collection-bootcamp-1-0/


Mark,

I have been using that collector.  When I had a very large young 
generation (NewRatio=1), most of the really long collections were 
ParNew.  When I lowered the young generation size drastically, the 
overall situation got slightly better.  Unfortunately there are still 
long pauses, but now they are CMS.  I wrote a handy little perl script 
to parse a GC log and spit out a compact listing of every line that 
takes longer than half a second.


On my dev 4.1 server with Java 7u11, I am using the G1 collector with a 
max pause target of 1500ms.  I was thinking that this collector was 
producing long pauses too, but after reviewing the gc log with a closer 
eye, I see that there are lines that specifically say "pause" ... and 
all of THOSE lines are below half a second except one that took 1.4 
seconds.  Does that mean that it's actually meeting the target, or are 
the other lines that show quite long time values indicative of a 
problem?  If only the lines that explicitly say "pause" are the ones I 
need to worry about, then it looks like G1 is the clear winner.


My production servers are version 3.5 with Java 6u38.

After reading your bootcamp and consulting a few other guides, this was 
going to be my next step:


-Xms1024
-Xmx8192
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:NewRatio=3
-XX:MaxTenuringThreshold=8

I may try the G1 collector with Java 6 in production, since I am on the 
newest Oracle version.  I am very interested in knowing whether I need 
to worry about G1 log entries that don't say "pause" in them.  Below is 
an excerpt from a G1 log.  Notice how only a few of the lines actually 
say "pause" ... in addition to these here that say (young) there are 
some pause lines that say (mixed):


http://fpaste.org/U0aQ/

Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-18 Thread Mark Miller

On Jan 6, 2013, at 5:41 PM, Shawn Heisey  wrote:

> Clarification of my question and my goals:
> 
> What I *want* is for all GC pauses to be half a second or less.

I'd try working with the concurrent, low pause collector. Any of the stop the 
world collectors mixed with a large heap will likely mean a few second pauses 
at least at some points. A well tuned concurrent collector will never step the 
world in most situations.

-XX:+UseConcMarkSweepGC

I wrote an article that might be useful a while back: 
http://searchhub.org/2011/03/27/garbage-collection-bootcamp-1-0/

- Mark

Re: Long ParNew GC pauses - even when young generation is small

2013-01-18 Thread Shawn Heisey

On 1/18/2013 12:40 PM, giltene wrote:

<< product bragging alert >>

We've got people running Solr on the Zing JVM at various places for exactly
this reason. A key side effect of running on Zing is the complete
elimination of GC effects, with no code changes or tuning needed.

So instead of wanting pauses of half a second or less, and settling for
pauses of 2 seconds or less (per your message), you can instead actually run
on a JVM with a GC behavior that drops noise to below 20 msec with Solr. And
you can get this the day you turn Zing on, without needing to know any more
about tuning, or having to make the various interplays or tradeoffs around
things like Parnew sizing, heap sizing, occupancy thresholds, and the 75
other flags that may strongly affect your user experience.


I don't see any info on your website about pricing, so I can't make any 
decisions about whether it would be right for me.  Can you give me 
long-term pricing information?


Chances are that once I inform management of the cost, it'd never fly.

Does anyone know how to get good GC pause characteristics with Solr and 
the latest Oracle Java 7?


Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-18 Thread giltene
<< product bragging alert >>

We've got people running Solr on the Zing JVM at various places for exactly
this reason. A key side effect of running on Zing is the complete
elimination of GC effects, with no code changes or tuning needed.
   
So instead of wanting pauses of half a second or less, and settling for
pauses of 2 seconds or less (per your message), you can instead actually run
on a JVM with a GC behavior that drops noise to below 20 msec with Solr. And
you can get this the day you turn Zing on, without needing to know any more
about tuning, or having to make the various interplays or tradeoffs around
things like Parnew sizing, heap sizing, occupancy thresholds, and the 75
other flags that may strongly affect your user experience.
 
-- Gil. (CTO, Azul Systems).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034645.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Long ParNew GC pauses - even when young generation is small

2013-01-06 Thread Shawn Heisey

On 1/6/2013 9:14 AM, Shawn Heisey wrote:

I have now reduced my Eden size from about 4GB to under 256MB (max heap
is still 8GB), but the long pauses are still happening on my Solr 3.5.0
servers, and they are still ParNew collections.


Clarification of my question and my goals:

What I *want* is for all GC pauses to be half a second or less.  What I 
can settle for is all GC pauses to be two seconds or less.  My load 
balancer's health check interval and timeout are both five seconds.  Is 
there any viable GC config that can achieve this with an 8GB heap?  I 
need the heap to be that big in order for full-imports to succeed.  Any 
smaller and a full-import takes forever due to constant GC.


I'm now doing some experimentation with not specifying any GC options at 
all, which results in jconsole showing PS MarkSweep and PS Scavenge for 
the GC methods.  PS Scavenge collections seem to meet my half second 
requirement, but the (very infrequent) PS MarkSweep collections tend to 
be slightly above 3 seconds.  It's Sunday, and traffic is low, so I must 
try putting some load on it to verify.


From what I remember back when I first started with Solr, the default 
GC options lead to elevated total GC times during full-import index 
creation, but I can arrange for the load balancer to ignore servers that 
are doing a full-import.  If the default options work well for normal 
operation, and a full-import isn't very slow, that might be worth 
looking into.


Thanks,
Shawn



Re: Long ParNew GC pauses - even when young generation is small

2013-01-06 Thread Shawn Heisey

On 1/6/2013 9:14 AM, Shawn Heisey wrote:

What other information do I need to provide to track down what's happening?


Some additional info - part of a GC log showing a pause of over 8 
seconds (second entry), followed by *much* shorter pauses:


2013-01-06T01:52:02.653-0700: 41837.979: [GC 41837.979: [ParNew: 
235968K->26176K(235968K), 2.8210460 secs] 7053020K->6877758K(7565932K) 
icms_dc=0 , 2.8212100 secs] [Times: user=2.55 sys=2.74, real=2.82 secs]
2013-01-06T01:52:26.262-0700: 41861.588: [GC 41861.588: [ParNew: 
235968K->26176K(235968K), 8.1470560 secs] 7087550K->6968424K(7565932K) 
icms_dc=0 , 8.1472350 secs] [Times: user=0.98 sys=7.89, real=8.14 secs]
2013-01-06T01:53:40.047-0700: 41935.372: [GC 41935.373: [ParNew: 
235946K->26176K(235968K), 1.7302800 secs] 7178194K->6992434K(7565932K) 
icms_dc=0 , 1.7304450 secs] [Times: user=0.49 sys=1.64, real=1.73 secs]
2013-01-06T01:54:57.482-0700: 42012.808: [GC 42012.808: [ParNew: 
235968K->22501K(235968K), 1.6947300 secs] 7202226K->7009700K(7565932K) 
icms_dc=0 , 1.6948950 secs] [Times: user=0.34 sys=1.63, real=1.70 secs]
2013-01-06T01:55:42.503-0700: 42057.829: [GC 42057.829: [ParNew: 
232293K->15609K(235968K), 0.0275450 secs] 7219492K->7002809K(7565932K) 
icms_dc=0 , 0.0277000 secs] [Times: user=0.19 sys=0.00, real=0.03 secs]
2013-01-06T01:56:34.795-0700: 42110.121: [GC 42110.121: [ParNew: 
225401K->26176K(235968K), 0.0427490 secs] 7212601K->7014742K(7565932K) 
icms_dc=0 , 0.0429060 secs] [Times: user=0.22 sys=0.00, real=0.05 secs]
2013-01-06T01:57:35.874-0700: 42171.200: [GC 42171.200: [ParNew: 
235968K->19469K(235968K), 0.1146220 secs] 7224534K->7015687K(7565932K) 
icms_dc=0 , 0.1147830 secs] [Times: user=0.39 sys=0.08, real=0.12 secs]
2013-01-06T01:58:49.338-0700: 42244.664: [GC 42244.664: [ParNew: 
229261K->24056K(235968K), 0.0462400 secs] 7225479K->7020273K(7565932K) 
icms_dc=0 , 0.0463990 secs] [Times: user=0.22 sys=0.00, real=0.05 secs]


Thanks,
Shawn