Hi Phillippa,
My guess would be that you are running some heavy queries (faceting/deep
paging/large pages) or have high query load (can you give bit details
about load) or have misconfigured caches. Do you query entire index or
you have query routing?
You have big machine and might consider running two Solr on each node
(with smaller heap) and split shards so queries can be more
parallelized, resources better utilized, and smaller heap to GC.
Regards,
Emir
On 08.12.2015 10:49, philippa griggs wrote:
Hello Erick,
Thanks for your reply.
We have one collection and are writing documents to that collection all the
time- it peaks at around 2,500 per minute and dips to 250 per minute, the size
of the document varies. On each node we have around 55,000,000 documents with a
data size of 43G located on a drive of 200G.
Each node has 122G memory, the heap size is currently set at 45G although we
have plans to increase this to 50G.
The heap settings we are using are:
-XX: +UseG1GC,
-XX:+ParallelRefProcEnabled.
Please let me know if you need any more information.
Philippa
________________________________________
From: Erick Erickson <erickerick...@gmail.com>
Sent: 07 December 2015 16:53
To: solr-user
Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Tell us a bit more.
Are you adding documents to your collections or adding more
collections? Solr is a balancing act between the number of docs you
have on each node and the memory you have allocated. If you're
continually adding docs to Solr, you'll eventually run out of memory
and/or hit big GC pauses.
How much memory are you allocating to Solr? How much physical memory
to you have? etc.
Best,
Erick
On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs
<philippa.gri...@hotmail.co.uk> wrote:
Hello,
I'm using:
Solr 5.2.1 10 shards each with a replica. (20 nodes in total)
Zookeeper 3.4.6.
About half a year ago we upgraded to Solr 5.2.1 and since then have been
experiencing a 'wipe out' effect where all of a sudden most if not all nodes
will go down. Sometimes they will recover by themselves but more often than not
we have to step in to restart nodes.
Nothing in the logs jumps out as being the problem. With the latest wipe out we
noticed that 10 out of the 20 nodes had garbage collections over 1min all at
the same time, with the heap usage spiking up in some cases to 80%. We also
noticed the amount of selects run on the solr cluster increased just before the
wipe out.
Increasing the heap size seems to help for a while but then it starts happening
again- so its more like a delay than a fix. Our GC settings are set to -XX:
+UseG1GC, -XX:+ParallelRefProcEnabled.
With our previous version of solr (4.10.0) this didn't happen. We had
nodes/shards go down but it was contained, with the new version they all seem
to go at around the same time. We can't really continue just increasing the
heap size and would like to solve this issue rather than delay it.
Has anyone experienced something simular?
Is there a difference between the two versions around the recovery process?
Does anyone have any suggestions on a fix.
Many thanks
Philippa
>
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/