Re: Solr Cloud (6.6.3), Zookeeper(3.4.10) and ELB's

Shawn Heisey Sat, 02 Jun 2018 08:32:28 -0700

On 6/2/2018 5:20 AM, solrnoobie wrote:

Thank you for pointing out our error in having an ELB on top of a zookeeper.
We did this so that we could recover a node if it goes down without the need
to have a rolling restart of the solr nodes. I guess we will try an elastic
IP instead because part of our requirement is that it should automatically
spawn an EC2 instance with a zk node if for some reason one instance fails.
I guess this way we still won't need to restart our solr nodes and still
replace the zknode(s) behind an elastic IP?

ZK servers and clients make TCP connections to all of the servers intheir config, and if things are working right, don't ever close thoseconnections. If you put a load balancer in there, it can REALLY confusethe system making the connection.

If you have three ZK servers and one of them fails, all the clients andremaining servers should be able to deal with this, and when the servercomes back, they should deal with that too. If that doesn't happen, itmight be a bug in ZK and the ZK project will treat it seriously.

ZK version 3.4.x, which is the current stable release and what Solrships with, cannot dynamically add or remove servers, so spinning up abrand new ZK server is not going to work out. To add or remove a serverin the ZK cluster, *EVERY* client and server is going to need to bemanually reconfigured and restarted.

Dynamic ensemble membership is available in ZK 3.5.x, which is currentlyin beta. If I had to guess about when Solr will upgrade, I would say itwill happen on the second or third stable 3.5.x release, so there isenough time to be sure the software really is battle-tested. ZK has a*VERY* slow release cycle, so I am expecting this to take severalmonths. The upgrade is not going to happen in Solr 6.x, though. Expectit in a later 7.x release or maybe 8.0.

I'm guessing we are experiencing problems with leader election because the
solr nodes can't maintain a tcp connection with the zknodes but I don't have
a way of proving that so our team can't really pitch this to our architect.
I hope someone here can help me with this since it has been a problem for a
LONG time now and we are getting a lot of flak from the other stakeholders
because of this.

I hope what I've said above is helpful. I think that eliminating loadbalancer usage for ZK and automatic service restart will help. If youARE experiencing situations where the services die or stop responding,chances are really good that you are running into OOME. If that iswhat's happening, you will need to figure out what resource is short andmake more of that resource available. It's usually Java heap memory,but it could be other things like the inability to start a new threadbecause the OS has a low limit on the number of processes that a user isallowed to start. You'll have to check logs to see exactly what wentwrong. If the logfile doesn't show anything, then the OS might havedecided to kill the process for its own reasons, which should be in thesystem log.

The Solr 6.6.3 version is not a bad choice. The latest 6.x release is6.6.4, but the problem fixed in 6.6.4 is probably not affecting you. There's a lot of work in 7.x for SolrCloud stability, but a majorversion upgrade is not something to treat lightly, and should only besomething that you attempt if it is *ALREADY* what you plan to do.


Thanks,
Shawn

Re: Solr Cloud (6.6.3), Zookeeper(3.4.10) and ELB's

Reply via email to