Thanks Emir. The index is equally split between the two shards, each having
approx 35gb. The total number of documents is around 11 million which should
be distributed equally among the two shards. So, each core should take 3gb
of the heap for a full cache. Not sure I get the "multiply it by number of
replica". Shouldn't each replica have its own cache of 3gb? Moreover, based
on the SPM graph, the max filter cache size during the outages have been 1.5
million max.

Majority of our queries are heavily dependent on some implicit filter and
user selected ones. By reducing the filter cache size to the current one of
4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache
size of 10,000 (running on 15gb allocated heap)  which produced a 95% hit
rate. With the memory issues in 6.6,  I started reducing it to the current
value. It reduced the % hit to 25. I tried earlier reducing the value to  
<filterCache class="solr.FastLRUCache" size="256" initialSize="256"
autowarmCount="0"/>. 
It still didn't help which is when I decided to go for a higher RAM machine.
What I've noticed is that the heap is consistently around 22-23gb mark out
of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest
shared by G Survivor space, Metaspace and Code cache. 

This issue has been bothering me as I seemed to be running out of possible
tuning options. What I could see from the monitoring tool is the surge
period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a
really high volume of load to handle for a cluster size 6 nodes with 16 CPU
/ 64gb RAM? What are the other options I should be looking into? 

The other thing which I'm still confused is why the recovery fails when the
memory has been freed up.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to