Re: Long ParNew GC pauses - even when young generation is small
On 7/7/2014 10:22 AM, aferdous wrote: > Hi Shawn - I was just wondering how did you resolve this issue in the end. We > are seeing the same issue with our platform (similar heap size) and updater > volume. > > It would be nice if you could provide us with your final findings/configs. I use the following GC tuning parameters now, with FAR better results: http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning The G1 collector performed worse for me than untuned CMS. The average collection was awesome, but when it finally did do a stop-the-world, those were even longer. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
Hi Shawn - I was just wondering how did you resolve this issue in the end. We are seeing the same issue with our platform (similar heap size) and updater volume. It would be nice if you could provide us with your final findings/configs. -- View this message in context: http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4145938.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Long ParNew GC pauses - even when young generation is small
If you believe the logs, using -XX:+PrintGCApplicationStoppedTime is probably the easiest way to avoid having to try to parse pause times from various formats. But remember, GC logs can [often unintentionally] lie (I've seen them under-report by multi-second gaps). If you want to actually measure your JVM pauses (GC or others), you can use something like jHiccup (http://www.azulsystems.com/jHiccup). It is a free (as in beer) and public domain (CC0) tool that will show you any blip/glitch/hiccup that you jvm experiences while running your application, and report it in both time based and detailed percentile form. What jHccup shows you is a "best-case" response time for your applicartion as it runs (the response time the application would have shown if completed everything as zero work). It's near-trivial to add jHiccup to your environment (as either a java agent or wrapper script). It would be interesting to see the percentile histograms (jHiccup's .hgrm text output) for your environment. -- View this message in context: http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034934.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Long ParNew GC pauses - even when young generation is small
> I don't see any info on your website about pricing, so I can't make any > decisions about whether it would be right for me. Can you give me > long-term pricing information? As is the case with much of enterprise software (including getting a supported version of Oracle HotSpot), this is a sales-person conversation that we'd be happy to have. You can ask for someone to contact you about this right on the site, or if you want, you can contact me at gil azulsystems com and I'll make sure we get you the information you need. > Chances are that once I inform management of the cost, it'd never fly. You may be surprised. You seem to assume that Zing is expensive for some reason, while it's probably on par or cheaper than other supported JVMs for this sort of thing. It's certainly "flown with management" for others running into the exact same problems with both Solr and Lucene. Saved them both time and money in the process of forever removing GC headaches. -- View this message in context: http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034932.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Long ParNew GC pauses - even when young generation is small
Unfortunately, G1 on Java 6 was a bust. Several times GC pauses made my load balancer think the server was down, just like with CMS/ParNew. Either there's something about my production query patterns that doesn't get along with any of the garbage collection methods, or I need to upgrade to Java 7. I have tried lowering my max heap before. That results in OOM problems when I do full-import with DIH. On 1/20/2013 2:13 PM, Markus Jelsma wrote: Hi Shawn, Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per node) we saw decreased throughput and higher latency with G1 on Java 6. You can also expect higher CPU consumption. You can check it very well with VisualVM attached. Looking forward to your results. Markus -Original message- From:Shawn Heisey Sent: Sun 20-Jan-2013 21:48 To: solr-user@lucene.apache.org Subject: Re: Long ParNew GC pauses - even when young generation is small On 1/18/2013 10:07 PM, Shawn Heisey wrote: I may try the G1 collector with Java 6 in production, since I am on the newest Oracle version. I am giving this a try on my secondary server set. An encouraging note: The -XX:+UnlockExperimentalVMOptions option is no longer required to use the G1 collector, at least on version 6u38. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/20/2013 2:13 PM, Markus Jelsma wrote: Hi Shawn, Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per node) we saw decreased throughput and higher latency with G1 on Java 6. You can also expect higher CPU consumption. You can check it very well with VisualVM attached. Looking forward to your results. I don't have any really good test tools developed for testing throughput and latency. I have some less-than-ideal tools for other purposes that I might be able to adapt. Throughput is not a major issue for us - query volume is quite low. I would be mildly surprised by 5 queries per second. I don't have much of an idea of queries per second over the short term - the numbers available in 3.5 are limited. As for latency, early indications from an old SOLR-1972 patch suggest that the QTime values might be a little higher. The primary server stats (using CMS/ParNew) are over 1 million queries, and the secondary server stats (using G1) so far are only about 5000 queries. The QTime values are steadily dropping as the number of queries goes up. Here's a status page that gathers all the stats. Chain A is using CMS/ParNew and is no longer receiving queries. All the queries are now going to chain B, which is using G1. http://dl.dropbox.com/u/97770508/g1-vs-cms-stats.png The server CPU utilization graph doesn't have enough information yet to make any determination, but what little data is visible suggests that CPU may be higher. The secondary servers also have slightly slower CPUs than the primary servers. I was forced to make concessions on later purchases to keep the cost down. Thanks, Shawn
RE: Long ParNew GC pauses - even when young generation is small
Hi Shawn, Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per node) we saw decreased throughput and higher latency with G1 on Java 6. You can also expect higher CPU consumption. You can check it very well with VisualVM attached. Looking forward to your results. Markus -Original message- > From:Shawn Heisey > Sent: Sun 20-Jan-2013 21:48 > To: solr-user@lucene.apache.org > Subject: Re: Long ParNew GC pauses - even when young generation is small > > On 1/18/2013 10:07 PM, Shawn Heisey wrote: > > I may try the G1 collector with Java 6 in production, since I am on the > > newest Oracle version. > > I am giving this a try on my secondary server set. An encouraging note: > The -XX:+UnlockExperimentalVMOptions option is no longer required to use > the G1 collector, at least on version 6u38. > > Thanks, > Shawn > >
Re: Long ParNew GC pauses - even when young generation is small
On 1/18/2013 10:07 PM, Shawn Heisey wrote: I may try the G1 collector with Java 6 in production, since I am on the newest Oracle version. I am giving this a try on my secondary server set. An encouraging note: The -XX:+UnlockExperimentalVMOptions option is no longer required to use the G1 collector, at least on version 6u38. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/20/2013 11:33 AM, Shawn Heisey wrote: On 1/18/2013 10:07 PM, Shawn Heisey wrote: On my dev 4.1 server with Java 7u11, I am using the G1 collector with a max pause target of 1500ms. I was thinking that this collector was producing long pauses too, but after reviewing the gc log with a closer eye, I see that there are lines that specifically say "pause" ... and all of THOSE lines are below half a second except one that took 1.4 seconds. Does that mean that it's actually meeting the target, or are the other lines that show quite long time values indicative of a problem? If only the lines that explicitly say "pause" are the ones I need to worry about, then it looks like G1 is the clear winner. Here's a paste showing a command and its output. I included "remark" in the grep because I saw a presentation saying that remark in G1 is stop-the-world: http://pastie.org/private/vygpvtjzicsl8uztg3drw None of the matching log lines get close to my 5 second pain point. If I check the entire unfiltered log for lines that exceed 3 seconds, I do find a few, but only one of them says "pause" and it's far enough below the 5 second level that it probably would not cause a problem: http://pastie.org/private/wcessvbrditextxmoapksq Here's the perl script used in the two outputs above: http://pastie.org/private/itu9hbgiwugdjtmy3yg8g Here's the full gc log for anyone that feels compelled to fully investigate: http://dl.dropbox.com/u/97770508/gc.log Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/18/2013 10:07 PM, Shawn Heisey wrote: On my dev 4.1 server with Java 7u11, I am using the G1 collector with a max pause target of 1500ms. I was thinking that this collector was producing long pauses too, but after reviewing the gc log with a closer eye, I see that there are lines that specifically say "pause" ... and all of THOSE lines are below half a second except one that took 1.4 seconds. Does that mean that it's actually meeting the target, or are the other lines that show quite long time values indicative of a problem? If only the lines that explicitly say "pause" are the ones I need to worry about, then it looks like G1 is the clear winner. Here's a paste showing a command and its output. I included "remark" in the grep because I saw a presentation saying that remark in G1 is stop-the-world: http://pastie.org/private/vygpvtjzicsl8uztg3drw None of the matching log lines get close to my 5 second pain point. If I check the entire unfiltered log for lines that exceed 3 seconds, I do find a few, but only one of them says "pause" and it's far enough below the 5 second level that it probably would not cause a problem: http://pastie.org/private/wcessvbrditextxmoapksq Here's the perl script used in the two outputs above: http://pastie.org/private/itu9hbgiwugdjtmy3yg8g The log was gathered during a full-import of six large shards, over 12 million docs each. The import took 7 hours. I had the patches for LUCENE-4599 (Compressed TermVectors) applied to Solr 4.1 at the time. What I'd like to know is whether a 'concurrent-mark-end' line indicates stop-the-world or not. I suspect that it is done while the application is working. If this is right, then I think I have found the right GC settings: -XX:+UseG1GC -XX:MaxGCPauseMillis=1500 -XX:GCPauseIntervalMillis=4000 My production servers have more total memory, more CPU cores, and much faster I/O than the dev server where I have been running these tests, but they both use the same 8GB java heap. One last question: Should I be worried about using the G1 collector on Oracle Java 6u38, which was released at the same time as 7u11? This *might* be a good opportunity to upgrade to java 7 in production, actually. I have two completely independent index chains, I could upgrade the secondary. If anyone has any suggestions for my GC parsing perl script, or a knows about a much more functional replacement, let me know. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/18/2013 8:37 PM, Mark Miller wrote: On Jan 6, 2013, at 5:41 PM, Shawn Heisey wrote: Clarification of my question and my goals: What I *want* is for all GC pauses to be half a second or less. I'd try working with the concurrent, low pause collector. Any of the stop the world collectors mixed with a large heap will likely mean a few second pauses at least at some points. A well tuned concurrent collector will never step the world in most situations. -XX:+UseConcMarkSweepGC I wrote an article that might be useful a while back: http://searchhub.org/2011/03/27/garbage-collection-bootcamp-1-0/ Mark, I have been using that collector. When I had a very large young generation (NewRatio=1), most of the really long collections were ParNew. When I lowered the young generation size drastically, the overall situation got slightly better. Unfortunately there are still long pauses, but now they are CMS. I wrote a handy little perl script to parse a GC log and spit out a compact listing of every line that takes longer than half a second. On my dev 4.1 server with Java 7u11, I am using the G1 collector with a max pause target of 1500ms. I was thinking that this collector was producing long pauses too, but after reviewing the gc log with a closer eye, I see that there are lines that specifically say "pause" ... and all of THOSE lines are below half a second except one that took 1.4 seconds. Does that mean that it's actually meeting the target, or are the other lines that show quite long time values indicative of a problem? If only the lines that explicitly say "pause" are the ones I need to worry about, then it looks like G1 is the clear winner. My production servers are version 3.5 with Java 6u38. After reading your bootcamp and consulting a few other guides, this was going to be my next step: -Xms1024 -Xmx8192 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3 -XX:MaxTenuringThreshold=8 I may try the G1 collector with Java 6 in production, since I am on the newest Oracle version. I am very interested in knowing whether I need to worry about G1 log entries that don't say "pause" in them. Below is an excerpt from a G1 log. Notice how only a few of the lines actually say "pause" ... in addition to these here that say (young) there are some pause lines that say (mixed): http://fpaste.org/U0aQ/ Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On Jan 6, 2013, at 5:41 PM, Shawn Heisey wrote: > Clarification of my question and my goals: > > What I *want* is for all GC pauses to be half a second or less. I'd try working with the concurrent, low pause collector. Any of the stop the world collectors mixed with a large heap will likely mean a few second pauses at least at some points. A well tuned concurrent collector will never step the world in most situations. -XX:+UseConcMarkSweepGC I wrote an article that might be useful a while back: http://searchhub.org/2011/03/27/garbage-collection-bootcamp-1-0/ - Mark
Re: Long ParNew GC pauses - even when young generation is small
On 1/18/2013 12:40 PM, giltene wrote: << product bragging alert >> We've got people running Solr on the Zing JVM at various places for exactly this reason. A key side effect of running on Zing is the complete elimination of GC effects, with no code changes or tuning needed. So instead of wanting pauses of half a second or less, and settling for pauses of 2 seconds or less (per your message), you can instead actually run on a JVM with a GC behavior that drops noise to below 20 msec with Solr. And you can get this the day you turn Zing on, without needing to know any more about tuning, or having to make the various interplays or tradeoffs around things like Parnew sizing, heap sizing, occupancy thresholds, and the 75 other flags that may strongly affect your user experience. I don't see any info on your website about pricing, so I can't make any decisions about whether it would be right for me. Can you give me long-term pricing information? Chances are that once I inform management of the cost, it'd never fly. Does anyone know how to get good GC pause characteristics with Solr and the latest Oracle Java 7? Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
<< product bragging alert >> We've got people running Solr on the Zing JVM at various places for exactly this reason. A key side effect of running on Zing is the complete elimination of GC effects, with no code changes or tuning needed. So instead of wanting pauses of half a second or less, and settling for pauses of 2 seconds or less (per your message), you can instead actually run on a JVM with a GC behavior that drops noise to below 20 msec with Solr. And you can get this the day you turn Zing on, without needing to know any more about tuning, or having to make the various interplays or tradeoffs around things like Parnew sizing, heap sizing, occupancy thresholds, and the 75 other flags that may strongly affect your user experience. -- Gil. (CTO, Azul Systems). -- View this message in context: http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Long ParNew GC pauses - even when young generation is small
On 1/6/2013 9:14 AM, Shawn Heisey wrote: I have now reduced my Eden size from about 4GB to under 256MB (max heap is still 8GB), but the long pauses are still happening on my Solr 3.5.0 servers, and they are still ParNew collections. Clarification of my question and my goals: What I *want* is for all GC pauses to be half a second or less. What I can settle for is all GC pauses to be two seconds or less. My load balancer's health check interval and timeout are both five seconds. Is there any viable GC config that can achieve this with an 8GB heap? I need the heap to be that big in order for full-imports to succeed. Any smaller and a full-import takes forever due to constant GC. I'm now doing some experimentation with not specifying any GC options at all, which results in jconsole showing PS MarkSweep and PS Scavenge for the GC methods. PS Scavenge collections seem to meet my half second requirement, but the (very infrequent) PS MarkSweep collections tend to be slightly above 3 seconds. It's Sunday, and traffic is low, so I must try putting some load on it to verify. From what I remember back when I first started with Solr, the default GC options lead to elevated total GC times during full-import index creation, but I can arrange for the load balancer to ignore servers that are doing a full-import. If the default options work well for normal operation, and a full-import isn't very slow, that might be worth looking into. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/6/2013 9:14 AM, Shawn Heisey wrote: What other information do I need to provide to track down what's happening? Some additional info - part of a GC log showing a pause of over 8 seconds (second entry), followed by *much* shorter pauses: 2013-01-06T01:52:02.653-0700: 41837.979: [GC 41837.979: [ParNew: 235968K->26176K(235968K), 2.8210460 secs] 7053020K->6877758K(7565932K) icms_dc=0 , 2.8212100 secs] [Times: user=2.55 sys=2.74, real=2.82 secs] 2013-01-06T01:52:26.262-0700: 41861.588: [GC 41861.588: [ParNew: 235968K->26176K(235968K), 8.1470560 secs] 7087550K->6968424K(7565932K) icms_dc=0 , 8.1472350 secs] [Times: user=0.98 sys=7.89, real=8.14 secs] 2013-01-06T01:53:40.047-0700: 41935.372: [GC 41935.373: [ParNew: 235946K->26176K(235968K), 1.7302800 secs] 7178194K->6992434K(7565932K) icms_dc=0 , 1.7304450 secs] [Times: user=0.49 sys=1.64, real=1.73 secs] 2013-01-06T01:54:57.482-0700: 42012.808: [GC 42012.808: [ParNew: 235968K->22501K(235968K), 1.6947300 secs] 7202226K->7009700K(7565932K) icms_dc=0 , 1.6948950 secs] [Times: user=0.34 sys=1.63, real=1.70 secs] 2013-01-06T01:55:42.503-0700: 42057.829: [GC 42057.829: [ParNew: 232293K->15609K(235968K), 0.0275450 secs] 7219492K->7002809K(7565932K) icms_dc=0 , 0.0277000 secs] [Times: user=0.19 sys=0.00, real=0.03 secs] 2013-01-06T01:56:34.795-0700: 42110.121: [GC 42110.121: [ParNew: 225401K->26176K(235968K), 0.0427490 secs] 7212601K->7014742K(7565932K) icms_dc=0 , 0.0429060 secs] [Times: user=0.22 sys=0.00, real=0.05 secs] 2013-01-06T01:57:35.874-0700: 42171.200: [GC 42171.200: [ParNew: 235968K->19469K(235968K), 0.1146220 secs] 7224534K->7015687K(7565932K) icms_dc=0 , 0.1147830 secs] [Times: user=0.39 sys=0.08, real=0.12 secs] 2013-01-06T01:58:49.338-0700: 42244.664: [GC 42244.664: [ParNew: 229261K->24056K(235968K), 0.0462400 secs] 7225479K->7020273K(7565932K) icms_dc=0 , 0.0463990 secs] [Times: user=0.22 sys=0.00, real=0.05 secs] Thanks, Shawn