Thanks Erick, you've been very helpful. One other question I have, is it reasonable to upgrade zookeeper on an in-place SOLR? I see that 12727 appears to be verified with SOLR 7 modulo some test issues. For SOLR 6.6, would upgrading zookeeper to this version be advisable, or would you say that it would be risky? Of course I'll stage in a test environment, but it's hard to get the full story from just that...
Thanks! On Thu, Dec 13, 2018 at 7:09 PM Erick Erickson <erickerick...@gmail.com> wrote: > bq. will the leader still report that there were two followers, even > if one of them bounced > > I really can't say, I took the ZK folks' at their word and upgraded. > > I should think that restarting your ZK nodes should reestablish that > they are all talking to each other, you may need to restart your Solr > instances to see it take effect. > > Sorry I can't be more help > Erick > On Thu, Dec 13, 2018 at 3:15 PM Stephen Lewis Bianamara > <stephen.bianam...@gmail.com> wrote: > > > > Thanks for the help Erick. > > > > This is an external zookeeper, running on three separate AWS instances > > separate from the instances hosting SOLR. I think I have some more > insight > > based on the bug you sent and some more log crawling. > > > > In October we had an instance retirement, wherein the instance was > > automatically stopped and restarted. We verified on that instance that > echo > > ruok | nc localhost <<PORT>> returned imok . But, I just looked at that > > node with echo mntr | nc localhost <<PORT>>, and it appears to have never > > served a request! The first time I ran it there was 1 packet > sent/received, > > the next time 2 of each, the next time three.... It's reporting exactly > the > > number of times I run echo mntr | nc localhost <<PORT>> :) The other two > > machines each show millions of packets sent/received. It's quite weird > > because the leader zookeeper, reports 2 synced followers now, yet I > wonder > > why hasn't the node ever served a request if that's true. Quite bizarre. > > > > The three instances to talk over internal dns, I'm not totally sure if > the > > IP of the instance changed after its stop/start. I have seen this both > > change and not change on AWS, and I'm not sure what controls whether a > > stop/start changes the private IP. But I wonder if we can rule anything > > out; in the case of the dns bug 12727 > > <https://issues.apache.org/jira/browse/SOLR-12727>, will the leader > still > > report that there were two followers, even if one of them bounced? > > > > Finally, this log appears on the zookeeper machine and appears to be the > > first sign of trouble Unexpected exception causing shutdown while sock > > still open. I'm guessing that what's happened is that our zk cluster has > a > > failed quorum in some way, likely from 12727, but the leader still thinks > > the other node is a follower. So I wonder what is the fix to this > > situation? Is it to one-by-one stop and restart the other two zookeeper > > processes? > > > > Thanks a bunch, > > Stephen > > > > On Thu, Dec 13, 2018 at 8:10 AM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > Updates are disabled means that at least two of your three ZK nodes > > > are unreachable, which is worrisome. > > > > > > First: > > > That error is coming from Solr, but whether it's a Solr issue or a ZK > > > issue is ambiguous. Might be explained if the ZK nodes are under heavy > > > load. Question: Is this an external ZK ensemble? If so, what kind of > > > load are those machines under? If you're using the embedded ZK, then > > > stop-the-world GC could cause this. > > > > > > Second: > > > Yeah, increasing timeouts is one of the tricks, but tracking down why > > > the response is so slow would be indicated in either case. I don't > > > have much confidence in this solution in this case though. Losing > > > quorum indicates something else as the culprit. > > > > > > Third: > > > Not quite. The whole point of specifying the ensemble is that the ZK > > > client is smart enough to continue to function if quorum is present. > > > So it is _not_ the case that all the ZK instances need to be > > > reachable. > > > > > > On that topic, did you bounce your ZK servers or change them in any > > > other way? There's a known ZK issue when you reconfigure live ZK > > > ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727 > > > > > > Fourth: > > > See above. > > > > > > HTH, > > > Erick > > > On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara > > > <stephen.bianam...@gmail.com> wrote: > > > > > > > > Hello SOLR Community! > > > > > > > > I have a SOLR cluster which recently hit this error (full error > > > > below). ""Cannot > > > > talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 > and > > > > zookeeper 3.4.6. The first time this happened, we replaced a node > within > > > > our cluster. The second time, we followed the advice in this post > > > > < > > > > http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html > > > > > > > > and just restarted the SOLR service, which resolved the issue. I > traced > > > > this down (at least the second time) to this message: ""WARN > > > > (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ] > > > > o.a.s.c.c.ConnectionManager Watcher > > > > org.apache.solr.common.cloud.ConnectionManager@4586a480 name: > > > > ZooKeeperConnection > Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2. > > > > dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event > > > WatchedEvent > > > > state:Disconnected type:None path:null path: null type: None"". > > > > > > > > I'm wondering a few things. First, can you help me understand what > this > > > > error means in this context? Did the Zookeepers themselves > experience an > > > > issue, or just the SOLR node trying to talk to the zookeepers? There > was > > > > only one SOLR node affected, which was the leader, and thus stopped > all > > > > writes. Any way to trace this to a specific resource limitation? Our > ZK > > > > cluster looks to be rather low utilization, but perhaps I'm missing > > > > something. > > > > > > > > The second, what steps can I take to make the SOLR-zookeeper > interaction > > > > more fault tolerant in general? It seems to me like we might want to > (a) > > > > Increase the Zookeeper SyncLimit to provide more flexibility within > the > > > ZK > > > > quorum, but this would only help if the issue was truly on the zk > side. > > > We > > > > could also increase the tolerance on the SOLR side of things; would > this > > > be > > > > controlled via the zkClientTimeout? Any other thoughts? > > > > > > > > The third, is there some more fault tolerant ZK Connection string > than > > > > listing out all three ZK nodes? I *think*, and please correct me if > I'm > > > > wrong, this will require all three ZK nodes to be reporting as > healthy > > > for > > > > the SOLR node to consider the connection healthy. Is that true? Maybe > > > > including all three does mean a 2/3 quorum only need be maintained. > If > > > the > > > > connection health is based on quorum, Is moving a busy cluster to 5 > nodes > > > > for a 3/5 quorum desirable? Any other recommendations to make this > > > > healthier? > > > > > > > > Fourth, is any of the fault tolerance in this area improved in later > > > > SOLR/Zookeeper versions? > > > > > > > > Finally, this looks to be connected to this Jira issue > > > > <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't > > > appear > > > > to be very actionable unfortunately, but it appears people have > wondered > > > > this before. Are there any plans in the works to allow for recovery? > We > > > > found our ZK cluster was healthy and restarting the solr service > fixed > > > the > > > > issue, so it seems a reasonable feature to add auto-recovery on the > SOLR > > > > side when the ZK cluster returns to healthy. Would you agree? > > > > > > > > Thanks for your help!! > > > > Stephen > > > >