> Nope, no OOM errors. That's a good start!
> Insanity count is 0 and fieldCAche has 12 entries. We do use some boosting > functions. > > Btw, I am monitoring output via jconsole with 8gb of ram and it still goes > to 8gb every 20 seconds or so, > gc runs, falls down to 1gb. Hmm, maybe the garbage collector takes up a lot of CPU time. Could you check your garbage collector log? It must be enabled via some JVM options: JAVA_OPTS="$JAVA_OPTS -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails - Xloggc:/var/log/tomcat6/gc.log" Also, what JVM version are you using and what are your other JVM settings? Are Xms and Xmx at the same value? I see you're using the throughput collector. You might want to use CMS because it partially runs concurrently (the low- pause collector) and has less stop-the-world interruptions. http://download.oracle.com/javase/6/docs/technotes/guides/vm/cms-6.html Again, this may not be the issue ;) > > Btw, our current revision was just a random choice but up until two weeks > ago it has been rock-solid so we have been > reluctant to update to another version. Would you recommend upgrading to > latest trunk? I don't know what changes have been made since your revision. Please consult the CHANGES.txt for that. > > > It might not have anything to do with memory at all but i'm just asking. > > There > > may be a bug in your revision causing this. > > > > > Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get > > > > any > > > > > improvement in load. I can try monitoring with Jconsole > > > with 8gigs of heap to see if it helps. > > > > > > > Cheers, > > > > > > > > > Hello everyone, > > > > > > > > > > First of all here is our Solr setup: > > > > > > > > > > - Solr nightly build 986158 > > > > > - Running solr inside the default jetty comes with solr build > > > > > - 1 write only Master , 4 read only Slaves (quad core 5640 with > > > > > 24gb > > > > of > > > > > > > RAM) - Index replicated (on optimize) to slaves via Solr > > > > > Replication - Size of index is around 2.5gb > > > > > - No incremental writes, index is created from scratch(delete old > > > > > > > > documents > > > > > > > > > -> commit new documents -> optimize) every 6 hours > > > > > - Avg # of request per second is around 60 (for a single slave) > > > > > - Avg time per request is around 25ms (before having problems) > > > > > - Load on each is slave is around 2 > > > > > > > > > > We are using this set-up for months without any problem. However > > > > > last > > > > > > > > week > > > > > > > > > we started to experience very weird performance problems like : > > > > > > > > > > - Avg time per request increased from 25ms to 200-300ms (even > > > > > higher > > > > if > > > > > > we > > > > > > > > > don't restart the slaves) > > > > > - Load on each slave increased from 2 to 15-20 (solr uses %400-%600 > > > > > cpu) > > > > > > > > > > When we profile solr we see two very strange things : > > > > > > > > > > 1 - This is the jconsole output: > > > > > > > > > > https://skitch.com/meralan/rwwcf/mail-886x691 > > > > > > > > > > As you see gc runs for every 10-15 seconds and collects more than 1 > > > > gb > > > > > > > of memory. (Actually if you wait more than 10 minutes you see > > > > > spikes up to > > > > > > > > 4gb > > > > > > > > > consistently) > > > > > > > > > > 2 - This is the newrelic output : > > > > > > > > > > https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm > > > > > > > > > > As you see solr spent ridiculously long time in > > > > > SolrDispatchFilter.doFilter() method. > > > > > > > > > > > > > > > Apart form these, when we clean the index directory, re-replicate > > > > > and restart each slave one by one we see a relief in the system > > > > > but > > > > after > > > > > > some > > > > > > > > > time servers start to melt down again. Although deleting index and > > > > > replicating doesn't solve the problem, we think that these problems > > > > are > > > > > > > somehow related to replication. Because symptoms started after > > > > > > > > replication > > > > > > > > > and once it heals itself after replication. I also see > > > > > lucene-write.lock files in slaves (we don't have write.lock files > > > > > in the master) which I think we shouldn't see. > > > > > > > > > > > > > > > If anyone can give any sort of ideas, we will appreciate it. > > > > > > > > > > Regards, > > > > > Dogacan Guney