On 6/2/2018 1:49 AM, solrnoobie wrote:
Our team is having problems with our production setup in AWS.

Our current setup is:
- Dockerized solr nodes behind an ELB

Putting Solr behind a load balancer is a pretty normal thing to do.

- zookeeper with exhibitor in a docker container (3 of this set)

I don't know anything about exhibitor.  You'd need to discuss that with a zookeeper expert.

- solr talks to a zookeeper through an ELB (should we even do this? we did
this for recovery purposes so if there are better ways to handle this,
please describe it in your reply)

Definitely not.  ZK is designed for fault tolerance *without* a load balancer.  Solr gets configured with all the ZK servers and will connect to all of them at the same time.  Every ZK server in the ensemble is configured with a list of all ZK servers, and they communicate with each other directly.  Putting a load balancer in the mix can *cause* problems, it won't solve them.

- There are scripts in zknodes and solr nodes to monitor and restart docker
containers if it goes down.

In general it's probably not a good idea to restart Solr automatically if it goes down.  Incidences where Solr crashes are EXTREMELY rare.  I bet if you asked the zookeeper project about automatically restarting their software, they would tell you the same thing.

There is one relatively common scenario where Solr *will* stop running:  If Java experiences an OutOfMemoryError exception. Solr is designed to kill itself when OOME is thrown, because program operation is completely unpredictable after OOME. Stopping all operation is the only safe thing to do.

This is why it's a bad idea to restart Solr after OOME: Encountering that exception is caused by a resource shortage. Usually it's Java heap memory that's run out, but there are other resource shortages that lead to that exception.  Chances are excellent that once Solr gets back online and begins handling load, the same resource shortage is almost certain to happen again.  It could happen repeatedly, leading to a constant restart cycle that becomes a stability nightmare.  Instead of immediately restarting, the resource shortage problem must be fixed.

So in production, solrnodes sometimes goes down and will be restarted by the
scripts. During recovery, some shards won't have a leader and because of
that, indexing won't work. Adding replica's will also sometimes yield to
multiple replica's in the same node with a lot more than we want (we added
one and got eight at one time).

Often when one node dies because of a resource shortage, it can cause the other nodes to take on more load and then *also* die because of the same kind of resource shortage.  Outages on multiple servers in quick succession can be one reason for having recovery problems.  One thing you might need to do when the cloud becomes unstable is to shut down all Solr servers and then start them back up one at a time and make sure that everything on that server has recovered before starting another one.

One thing that's important to say again: Except for when Solr is killed by its own OOM killer or by the OOM killer in the operating system, Solr basically NEVER crashes.  I'm not saying that it can't happen, but I've only ever seen it in cases where the server hardware was failing.

If the OOM killer in the OS is responsible for Solr stopping, then your Solr logfile will not record any exceptions. When the OS-level OOM killer is triggered, it's usually an indication that a serious mistake has been made in choosing Solr's max heap size.

It's hard to say exactly why you might end up with shards that don't have a leader.  Check your solr logfiles for error messages.  I will say that the automatic restarts you've described could be a big part of the problem.

Just so you know, Solr tends to want there to be a LOT of memory available.  The amount required for good performance is sometimes shocking to users.  Here's a page that describes some Solr performance problems, and tries to explain that memory is typically the resource that's at the root of most of those problems:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Reply via email to