Re: Solr Cloud (6.6.3), Zookeeper(3.4.10) and ELB's

Shawn Heisey Sat, 02 Jun 2018 01:50:05 -0700

On 6/2/2018 1:49 AM, solrnoobie wrote:

Our team is having problems with our production setup in AWS.


Our current setup is:
- Dockerized solr nodes behind an ELB


Putting Solr behind a load balancer is a pretty normal thing to do.

- zookeeper with exhibitor in a docker container (3 of this set)

I don't know anything about exhibitor. You'd need to discuss that witha zookeeper expert.

- solr talks to a zookeeper through an ELB (should we even do this? we did
this for recovery purposes so if there are better ways to handle this,
please describe it in your reply)

Definitely not. ZK is designed for fault tolerance *without* a loadbalancer. Solr gets configured with all the ZK servers and will connectto all of them at the same time. Every ZK server in the ensemble isconfigured with a list of all ZK servers, and they communicate with eachother directly. Putting a load balancer in the mix can *cause*problems, it won't solve them.

- There are scripts in zknodes and solr nodes to monitor and restart docker
containers if it goes down.

In general it's probably not a good idea to restart Solr automaticallyif it goes down. Incidences where Solr crashes are EXTREMELY rare. Ibet if you asked the zookeeper project about automatically restartingtheir software, they would tell you the same thing.

There is one relatively common scenario where Solr *will* stop running: If Java experiences an OutOfMemoryError exception. Solr is designed tokill itself when OOME is thrown, because program operation is completelyunpredictable after OOME. Stopping all operation is the only safe thingto do.

This is why it's a bad idea to restart Solr after OOME: Encounteringthat exception is caused by a resource shortage. Usually it's Java heapmemory that's run out, but there are other resource shortages that leadto that exception. Chances are excellent that once Solr gets backonline and begins handling load, the same resource shortage is almostcertain to happen again. It could happen repeatedly, leading to aconstant restart cycle that becomes a stability nightmare. Instead ofimmediately restarting, the resource shortage problem must be fixed.

So in production, solrnodes sometimes goes down and will be restarted by the
scripts. During recovery, some shards won't have a leader and because of
that, indexing won't work. Adding replica's will also sometimes yield to
multiple replica's in the same node with a lot more than we want (we added
one and got eight at one time).

Often when one node dies because of a resource shortage, it can causethe other nodes to take on more load and then *also* die because of thesame kind of resource shortage. Outages on multiple servers in quicksuccession can be one reason for having recovery problems. One thingyou might need to do when the cloud becomes unstable is to shut down allSolr servers and then start them back up one at a time and make surethat everything on that server has recovered before starting another one.

One thing that's important to say again: Except for when Solr is killedby its own OOM killer or by the OOM killer in the operating system, Solrbasically NEVER crashes. I'm not saying that it can't happen, but I'veonly ever seen it in cases where the server hardware was failing.

If the OOM killer in the OS is responsible for Solr stopping, then yourSolr logfile will not record any exceptions. When the OS-level OOMkiller is triggered, it's usually an indication that a serious mistakehas been made in choosing Solr's max heap size.

It's hard to say exactly why you might end up with shards that don'thave a leader. Check your solr logfiles for error messages. I will saythat the automatic restarts you've described could be a big part of theproblem.

Just so you know, Solr tends to want there to be a LOT of memoryavailable. The amount required for good performance is sometimesshocking to users. Here's a page that describes some Solr performanceproblems, and tries to explain that memory is typically the resourcethat's at the root of most of those problems:


https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Solr Cloud (6.6.3), Zookeeper(3.4.10) and ELB's

Reply via email to