Re: entire farm fails at the same time with OOM issues

Ken Krugler Wed, 01 Dec 2010 09:41:24 -0800


On Nov 30, 2010, at 5:16pm, Robert Petersen wrote:

What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 bytememoryleak occurring on each commit, but it would take thousands ofcommits to
make that add up to anything right?

Typically when I run out of memory in Solr, it's during an indexupdate, when the new index searcher is getting warmed up.

Looking at the heap often shows ways to reduce memory requirements,e.g. you'll see a really big chunk used for a sorted field.

See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactorsfor more details.


-- Ken


-----Original Message-----
From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
and -XX:HeapDumpPath=<path to where you want the file to go>, so then
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!
Index
size is about 28GB.

However, twice now recently during a time of low load we have had a
fire

drill where I have seen tomcat/solr fail and become unresponsiveafter

some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load
balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four
fail at
the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the
master
and not to each other, but the master show no errors in the logs at
all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the
slaves
started occasionally not being able to get to the master.

This behavior makes me a little nervous...    =:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc


--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: entire farm fails at the same time with OOM issues

Reply via email to