Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is 
telling you. Tuning the occupation fraction of the tenured generation to a 
lower value than default and telling the JVM to only use your value to 
initiate a collection can help a lot. The same goes for sizing the young 
generation and sometimes the survivor ratio.

Consult the HotSpot CMS settings and young generation (or new) sizes. They are 
very important.

If you have multiple slaves under the same load you can easily try different 
configurations. Keeping an eye on the nodes with a tool like JConsole and at 
the same time tailing the GC log will help a lot. Don't forget to send updates 
and frequent commits or you won't be able to replay. I've never seen a Solr 
instance go down under heavy load and without commits but they tend to behave 
badly when commits occur while under heavy load with long cache warming times 
(and heap consumption).

You might also be suffering from memory fragmentation, this is bad and can lead 
to failure. You can configure the JVM to fore a compaction before a GC, that's 
nice but it does consume CPU time.

A query of death can, in theory, also happen when you sort on a very large 
dataset that isn't optimized, in this case the maxDoc value is too high.

Anyway, try some settings and monitor the nodes and please report your 
findings.

> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
> > Heap usage can spike after a commit. Existing caches are still in use and
> > new caches are being generated and/or auto warmed. Can you confirm this
> > is the case?
> 
> We see spikes after replication which I suspect is, as you say, because
> of the ensuing commit.
> 
> What we seem to have found is that when we weren't using the Concurrent
> GC stop-the-world gc runs would kill the app. Now that we're using CMS
> we occasionally find ourselves in situations where the app still has
> memory "left over" but the load on the machine spikes, the GC duty cycle
> goes to 100 and the app never recovers.> 
> Restarting usually helps but sometimes we have to take the machine out
> of the laod balancer, wait for a number of minutes and then out it back
> in.
> 
> We're working on two hypotheses
> 
> Firstly - we're CPU bound somehow and that at some point we cross some
> threshhold and GC or something else is just unable to to keep up. So
> whilst it looks like instantaneous death of the app it's actually
> gradual resource exhaustion where the definition of 'gradual' is 'a very
> short period of time' (as opposed to some cataclysmic infinite loop bug
> somewhere).
> 
> Either that or ... Secondly - there's some sort of Query Of Death that
> kills machines. We just haven't found it yet, even when replaying logs.
> 
> Or some combination of both. Or other things. It's maddeningly
> frustrating.
> 
> We're also got to try deploying a custom solr.war and try using the
> MMapDirectory to see if that helps with anything.

Reply via email to