Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Guido Medina Mon, 09 Dec 2013 09:07:45 -0800

Did you add the Garbage collection JVM options I suggested you?


-XX:+UseG1GC -XX:MaxGCPauseMillis=50

Guido.

On 09/12/13 16:33, Patrick O'Lone wrote:

Unfortunately, in a test environment, this happens in version 4.4.0 of
Solr as well.

I was trying to locate the release notes for 3.6.x it is too old, if I
were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you
since it is a minor release, locate the release notes and see if
something that is affecting you got fixed, also, I would be thinking on
moving on to 4.x which is quite stable and fast.

Like anything with Java and concurrency, it will just get better (and
faster) with bigger numbers and concurrency frameworks becoming more and
more reliable, standard and stable.

Regards,

Guido.

On 09/12/13 15:07, Patrick O'Lone wrote:

I have a new question about this issue - I create a filter queries of
the form:

fq=start_time:[* TO NOW/5MINUTE]

This is used to restrict the set of documents to only items that have a
start time within the next 5 minutes. Most of my indexes have millions
of documents with few documents that start sometime in the future.
Nearly all of my queries include this, would this cause every other
search thread to block until the filter query is re-cached every 5
minutes and if so, is there a better way to do it? Thanks for any
continued help with this issue!

We have a webapp running with a very high HEAP size (24GB) and we have
no problems with it AFTER we enabled the new GC that is meant to replace
sometime in the future the CMS GC, but you have to have Java 6 update
"Some number I couldn't find but latest should cover" to be able to use:

1. Remove all GC options you have and...
2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/

As a test of course, more information you can read on the following (and
interesting) article, we also have Solr running with these options, no
more pauses or HEAP size hitting the sky.

Don't get bored reading the 1st (and small) introduction page of the
article, page 2 and 3 will make lot of sense:
http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061



HTH,

Guido.

On 26/11/13 21:59, Patrick O'Lone wrote:

We do perform a lot of sorting - on multiple fields in fact. We have
different kinds of Solr configurations - our news searches do little
with regards to faceting, but heavily sort. We provide classified ad
searches and that heavily uses faceting. I might try reducing the JVM
memory some and amount of perm generation as suggested earlier. It
feels
like a GC issue and loading the cache just happens to be the victim
of a
stop-the-world event at the worse possible time.

My gut instinct is that your heap size is way too high. Try
decreasing it to like 5-10G. I know you say it uses more than that,
but that just seems bizarre unless you're doing something like
faceting and/or sorting on every field.

-Michael

-----Original Message-----
From: Patrick O'Lone [mailto:pol...@townnews.com]
Sent: Tuesday, November 26, 2013 11:59 AM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with
periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
try and thought I might get some insight from some others on this
list.

The load on the server is normally anywhere between 1-3. It's an
8-core machine with 40GB of RAM. I have about 25GB of index data that
is replicated to this server every 5 minutes. It's taking about 200
connections per second and roughly every 5-10 minutes it will stall
for about 30 seconds to a minute. The stall causes the load to go to
as high as 90. It is all CPU bound in user space - all cores go to
99% utilization (spinlock?). When doing a thread dump, the following
line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've
tried to correlate these events to the replication events - but even
with replication disabled - this still happens. We run multiple data
centers using Solr and I was comparing garbage collection processes
between and noted that the old generation is collected very
differently on this data center versus others. The old generation is
collected as a massive collect event (several gigabytes worth) - the
other data center is more saw toothed and collects only in 500MB-1GB
at a time. Here's my parameters to java (the same in all
environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
-classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
-Dcatalina.base=/usr/local/share/apache-tomcat \
-Dcatalina.home=/usr/local/share/apache-tomcat \
-Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way
for a couple of years now) - primarily removing CMS Incremental mode
as we have 8 cores and remarks on the internet suggest that it is
only for smaller SMP setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and
may not leave enough memory for mmap operations (MMap appears to be
used in the field cache). Based on active memory utilization in Java,
seems like I might be able to reduce down to 22GB safely - but I'm
not sure if that will help with the CPU issues.

I think field cache is used for sorting and faceting. I've started to
investigate facet.method, but from what I can tell, this doesn't seem
to influence sorting at all - only facet queries. I've tried setting
useFilterForSortQuery, and seems to require less field cache but
doesn't address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming
oversubscribed in terms of resources? Thanks for any help that is
offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Reply via email to