Hi,

Just curious, was there any resolution to this?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 8. feb. 2011, at 03.40, Markus Jelsma wrote:

> Do you have GC logging enabled? Tail -f the log file and you'll see what CMS 
> is 
> telling you. Tuning the occupation fraction of the tenured generation to a 
> lower value than default and telling the JVM to only use your value to 
> initiate a collection can help a lot. The same goes for sizing the young 
> generation and sometimes the survivor ratio.
> 
> Consult the HotSpot CMS settings and young generation (or new) sizes. They 
> are 
> very important.
> 
> If you have multiple slaves under the same load you can easily try different 
> configurations. Keeping an eye on the nodes with a tool like JConsole and at 
> the same time tailing the GC log will help a lot. Don't forget to send 
> updates 
> and frequent commits or you won't be able to replay. I've never seen a Solr 
> instance go down under heavy load and without commits but they tend to behave 
> badly when commits occur while under heavy load with long cache warming times 
> (and heap consumption).
> 
> You might also be suffering from memory fragmentation, this is bad and can 
> lead 
> to failure. You can configure the JVM to fore a compaction before a GC, 
> that's 
> nice but it does consume CPU time.
> 
> A query of death can, in theory, also happen when you sort on a very large 
> dataset that isn't optimized, in this case the maxDoc value is too high.
> 
> Anyway, try some settings and monitor the nodes and please report your 
> findings.
> 
>> On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
>>> Heap usage can spike after a commit. Existing caches are still in use and
>>> new caches are being generated and/or auto warmed. Can you confirm this
>>> is the case?
>> 
>> We see spikes after replication which I suspect is, as you say, because
>> of the ensuing commit.
>> 
>> What we seem to have found is that when we weren't using the Concurrent
>> GC stop-the-world gc runs would kill the app. Now that we're using CMS
>> we occasionally find ourselves in situations where the app still has
>> memory "left over" but the load on the machine spikes, the GC duty cycle
>> goes to 100 and the app never recovers.> 
>> Restarting usually helps but sometimes we have to take the machine out
>> of the laod balancer, wait for a number of minutes and then out it back
>> in.
>> 
>> We're working on two hypotheses
>> 
>> Firstly - we're CPU bound somehow and that at some point we cross some
>> threshhold and GC or something else is just unable to to keep up. So
>> whilst it looks like instantaneous death of the app it's actually
>> gradual resource exhaustion where the definition of 'gradual' is 'a very
>> short period of time' (as opposed to some cataclysmic infinite loop bug
>> somewhere).
>> 
>> Either that or ... Secondly - there's some sort of Query Of Death that
>> kills machines. We just haven't found it yet, even when replaying logs.
>> 
>> Or some combination of both. Or other things. It's maddeningly
>> frustrating.
>> 
>> We're also got to try deploying a custom solr.war and try using the
>> MMapDirectory to see if that helps with anything.

Reply via email to