On 6/2/2018 5:20 AM, solrnoobie wrote:
Thank you for pointing out our error in having an ELB on top of a zookeeper. We did this so that we could recover a node if it goes down without the need to have a rolling restart of the solr nodes. I guess we will try an elastic IP instead because part of our requirement is that it should automatically spawn an EC2 instance with a zk node if for some reason one instance fails. I guess this way we still won't need to restart our solr nodes and still replace the zknode(s) behind an elastic IP?
ZK servers and clients make TCP connections to all of the servers in their config, and if things are working right, don't ever close those connections. If you put a load balancer in there, it can REALLY confuse the system making the connection.
If you have three ZK servers and one of them fails, all the clients and remaining servers should be able to deal with this, and when the server comes back, they should deal with that too. If that doesn't happen, it might be a bug in ZK and the ZK project will treat it seriously.
ZK version 3.4.x, which is the current stable release and what Solr ships with, cannot dynamically add or remove servers, so spinning up a brand new ZK server is not going to work out. To add or remove a server in the ZK cluster, *EVERY* client and server is going to need to be manually reconfigured and restarted.
Dynamic ensemble membership is available in ZK 3.5.x, which is currently in beta. If I had to guess about when Solr will upgrade, I would say it will happen on the second or third stable 3.5.x release, so there is enough time to be sure the software really is battle-tested. ZK has a *VERY* slow release cycle, so I am expecting this to take several months. The upgrade is not going to happen in Solr 6.x, though. Expect it in a later 7.x release or maybe 8.0.
I'm guessing we are experiencing problems with leader election because the solr nodes can't maintain a tcp connection with the zknodes but I don't have a way of proving that so our team can't really pitch this to our architect. I hope someone here can help me with this since it has been a problem for a LONG time now and we are getting a lot of flak from the other stakeholders because of this.
I hope what I've said above is helpful. I think that eliminating load balancer usage for ZK and automatic service restart will help. If you ARE experiencing situations where the services die or stop responding, chances are really good that you are running into OOME. If that is what's happening, you will need to figure out what resource is short and make more of that resource available. It's usually Java heap memory, but it could be other things like the inability to start a new thread because the OS has a low limit on the number of processes that a user is allowed to start. You'll have to check logs to see exactly what went wrong. If the logfile doesn't show anything, then the OS might have decided to kill the process for its own reasons, which should be in the system log.
The Solr 6.6.3 version is not a bad choice. The latest 6.x release is 6.6.4, but the problem fixed in 6.6.4 is probably not affecting you. There's a lot of work in 7.x for SolrCloud stability, but a major version upgrade is not something to treat lightly, and should only be something that you attempt if it is *ALREADY* what you plan to do.
Thanks, Shawn