Hello everyone, First of all here is our Solr setup:
- Solr nightly build 986158 - Running solr inside the default jetty comes with solr build - 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb of RAM) - Index replicated (on optimize) to slaves via Solr Replication - Size of index is around 2.5gb - No incremental writes, index is created from scratch(delete old documents -> commit new documents -> optimize) every 6 hours - Avg # of request per second is around 60 (for a single slave) - Avg time per request is around 25ms (before having problems) - Load on each is slave is around 2 We are using this set-up for months without any problem. However last week we started to experience very weird performance problems like : - Avg time per request increased from 25ms to 200-300ms (even higher if we don't restart the slaves) - Load on each slave increased from 2 to 15-20 (solr uses %400-%600 cpu) When we profile solr we see two very strange things : 1 - This is the jconsole output: https://skitch.com/meralan/rwwcf/mail-886x691 As you see gc runs for every 10-15 seconds and collects more than 1 gb of memory. (Actually if you wait more than 10 minutes you see spikes up to 4gb consistently) 2 - This is the newrelic output : https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm As you see solr spent ridiculously long time in SolrDispatchFilter.doFilter() method. Apart form these, when we clean the index directory, re-replicate and restart each slave one by one we see a relief in the system but after some time servers start to melt down again. Although deleting index and replicating doesn't solve the problem, we think that these problems are somehow related to replication. Because symptoms started after replication and once it heals itself after replication. I also see lucene-write.lock files in slaves (we don't have write.lock files in the master) which I think we shouldn't see. If anyone can give any sort of ideas, we will appreciate it. Regards, Dogacan Guney