On 6/2/2018 1:49 AM, solrnoobie wrote:
Our team is having problems with our production setup in AWS.
Our current setup is:
- Dockerized solr nodes behind an ELB
Putting Solr behind a load balancer is a pretty normal thing to do.
- zookeeper with exhibitor in a docker container (3 of this set)
I don't know anything about exhibitor. You'd need to discuss that with
a zookeeper expert.
- solr talks to a zookeeper through an ELB (should we even do this? we did
this for recovery purposes so if there are better ways to handle this,
please describe it in your reply)
Definitely not. ZK is designed for fault tolerance *without* a load
balancer. Solr gets configured with all the ZK servers and will connect
to all of them at the same time. Every ZK server in the ensemble is
configured with a list of all ZK servers, and they communicate with each
other directly. Putting a load balancer in the mix can *cause*
problems, it won't solve them.
- There are scripts in zknodes and solr nodes to monitor and restart docker
containers if it goes down.
In general it's probably not a good idea to restart Solr automatically
if it goes down. Incidences where Solr crashes are EXTREMELY rare. I
bet if you asked the zookeeper project about automatically restarting
their software, they would tell you the same thing.
There is one relatively common scenario where Solr *will* stop running:
If Java experiences an OutOfMemoryError exception. Solr is designed to
kill itself when OOME is thrown, because program operation is completely
unpredictable after OOME. Stopping all operation is the only safe thing
to do.
This is why it's a bad idea to restart Solr after OOME: Encountering
that exception is caused by a resource shortage. Usually it's Java heap
memory that's run out, but there are other resource shortages that lead
to that exception. Chances are excellent that once Solr gets back
online and begins handling load, the same resource shortage is almost
certain to happen again. It could happen repeatedly, leading to a
constant restart cycle that becomes a stability nightmare. Instead of
immediately restarting, the resource shortage problem must be fixed.
So in production, solrnodes sometimes goes down and will be restarted by the
scripts. During recovery, some shards won't have a leader and because of
that, indexing won't work. Adding replica's will also sometimes yield to
multiple replica's in the same node with a lot more than we want (we added
one and got eight at one time).
Often when one node dies because of a resource shortage, it can cause
the other nodes to take on more load and then *also* die because of the
same kind of resource shortage. Outages on multiple servers in quick
succession can be one reason for having recovery problems. One thing
you might need to do when the cloud becomes unstable is to shut down all
Solr servers and then start them back up one at a time and make sure
that everything on that server has recovered before starting another one.
One thing that's important to say again: Except for when Solr is killed
by its own OOM killer or by the OOM killer in the operating system, Solr
basically NEVER crashes. I'm not saying that it can't happen, but I've
only ever seen it in cases where the server hardware was failing.
If the OOM killer in the OS is responsible for Solr stopping, then your
Solr logfile will not record any exceptions. When the OS-level OOM
killer is triggered, it's usually an indication that a serious mistake
has been made in choosing Solr's max heap size.
It's hard to say exactly why you might end up with shards that don't
have a leader. Check your solr logfiles for error messages. I will say
that the automatic restarts you've described could be a big part of the
problem.
Just so you know, Solr tends to want there to be a LOT of memory
available. The amount required for good performance is sometimes
shocking to users. Here's a page that describes some Solr performance
problems, and tries to explain that memory is typically the resource
that's at the root of most of those problems:
https://wiki.apache.org/solr/SolrPerformanceProblems
Thanks,
Shawn