Hi, Just curious, was there any resolution to this?
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 8. feb. 2011, at 03.40, Markus Jelsma wrote: > Do you have GC logging enabled? Tail -f the log file and you'll see what CMS > is > telling you. Tuning the occupation fraction of the tenured generation to a > lower value than default and telling the JVM to only use your value to > initiate a collection can help a lot. The same goes for sizing the young > generation and sometimes the survivor ratio. > > Consult the HotSpot CMS settings and young generation (or new) sizes. They > are > very important. > > If you have multiple slaves under the same load you can easily try different > configurations. Keeping an eye on the nodes with a tool like JConsole and at > the same time tailing the GC log will help a lot. Don't forget to send > updates > and frequent commits or you won't be able to replay. I've never seen a Solr > instance go down under heavy load and without commits but they tend to behave > badly when commits occur while under heavy load with long cache warming times > (and heap consumption). > > You might also be suffering from memory fragmentation, this is bad and can > lead > to failure. You can configure the JVM to fore a compaction before a GC, > that's > nice but it does consume CPU time. > > A query of death can, in theory, also happen when you sort on a very large > dataset that isn't optimized, in this case the maxDoc value is too high. > > Anyway, try some settings and monitor the nodes and please report your > findings. > >> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said: >>> Heap usage can spike after a commit. Existing caches are still in use and >>> new caches are being generated and/or auto warmed. Can you confirm this >>> is the case? >> >> We see spikes after replication which I suspect is, as you say, because >> of the ensuing commit. >> >> What we seem to have found is that when we weren't using the Concurrent >> GC stop-the-world gc runs would kill the app. Now that we're using CMS >> we occasionally find ourselves in situations where the app still has >> memory "left over" but the load on the machine spikes, the GC duty cycle >> goes to 100 and the app never recovers.> >> Restarting usually helps but sometimes we have to take the machine out >> of the laod balancer, wait for a number of minutes and then out it back >> in. >> >> We're working on two hypotheses >> >> Firstly - we're CPU bound somehow and that at some point we cross some >> threshhold and GC or something else is just unable to to keep up. So >> whilst it looks like instantaneous death of the app it's actually >> gradual resource exhaustion where the definition of 'gradual' is 'a very >> short period of time' (as opposed to some cataclysmic infinite loop bug >> somewhere). >> >> Either that or ... Secondly - there's some sort of Query Of Death that >> kills machines. We just haven't found it yet, even when replaying logs. >> >> Or some combination of both. Or other things. It's maddeningly >> frustrating. >> >> We're also got to try deploying a custom solr.war and try using the >> MMapDirectory to see if that helps with anything.