Hi Rick,
Do you see any errors in logs? Do you have any monitoring tool? Maybe you can 
check heap and GC metrics around time when incident happened. It is not large 
heap but some major GC could cause pause large enough to trigger some snowball 
and end up with node in recovery state.
What is indexing rate you observe? Why do you have max warming searchers 5 (did 
you mean this with autowarmingsearchers?) when you commit every 5 min? Why did 
you increase it - you seen errors with default 2? Maybe you commit every bulk?
Do you see similar behaviour when you just do indexing without queries?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Nov 2017, at 05:15, Rick Dig <teram...@gmail.com> wrote:
> 
> hello all,
> we are trying to run solrcloud 6.6 in a production setting.
> here's our config and issue
> 1) 3 nodes, 1 shard, replication factor 3
> 2) all nodes are 16GB RAM, 4 core
> 3) Our production load is about 2000 requests per minute
> 4) index is fairly small, index size is around 400 MB with 300k documents
> 5) autocommit is currently set to 5 minutes (even though ideally we would
> like a smaller interval).
> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> 7) all of this runs perfectly ok when indexing isn't happening. as soon as
> we start "nrt" indexing one of the follower nodes goes down within 10 to 20
> minutes. from this point on the nodes never recover unless we stop
> indexing.  the master usually is the last one to fall.
> 8) there are maybe 5 to 7 processes indexing at the same time with document
> batch sizes of 500.
> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> 10) no cpu and / or oom issues that we can see.
> 11) cpu load does go fairly high 15 to 20 at times.
> any help or pointers appreciated
> 
> thanks
> rick

Reply via email to