Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is telling you. Tuning the occupation fraction of the tenured generation to a lower value than default and telling the JVM to only use your value to initiate a collection can help a lot. The same goes for sizing the young generation and sometimes the survivor ratio.
Consult the HotSpot CMS settings and young generation (or new) sizes. They are very important. If you have multiple slaves under the same load you can easily try different configurations. Keeping an eye on the nodes with a tool like JConsole and at the same time tailing the GC log will help a lot. Don't forget to send updates and frequent commits or you won't be able to replay. I've never seen a Solr instance go down under heavy load and without commits but they tend to behave badly when commits occur while under heavy load with long cache warming times (and heap consumption). You might also be suffering from memory fragmentation, this is bad and can lead to failure. You can configure the JVM to fore a compaction before a GC, that's nice but it does consume CPU time. A query of death can, in theory, also happen when you sort on a very large dataset that isn't optimized, in this case the maxDoc value is too high. Anyway, try some settings and monitor the nodes and please report your findings. > On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said: > > Heap usage can spike after a commit. Existing caches are still in use and > > new caches are being generated and/or auto warmed. Can you confirm this > > is the case? > > We see spikes after replication which I suspect is, as you say, because > of the ensuing commit. > > What we seem to have found is that when we weren't using the Concurrent > GC stop-the-world gc runs would kill the app. Now that we're using CMS > we occasionally find ourselves in situations where the app still has > memory "left over" but the load on the machine spikes, the GC duty cycle > goes to 100 and the app never recovers.> > Restarting usually helps but sometimes we have to take the machine out > of the laod balancer, wait for a number of minutes and then out it back > in. > > We're working on two hypotheses > > Firstly - we're CPU bound somehow and that at some point we cross some > threshhold and GC or something else is just unable to to keep up. So > whilst it looks like instantaneous death of the app it's actually > gradual resource exhaustion where the definition of 'gradual' is 'a very > short period of time' (as opposed to some cataclysmic infinite loop bug > somewhere). > > Either that or ... Secondly - there's some sort of Query Of Death that > kills machines. We just haven't found it yet, even when replaying logs. > > Or some combination of both. Or other things. It's maddeningly > frustrating. > > We're also got to try deploying a custom solr.war and try using the > MMapDirectory to see if that helps with anything.