The ZK ensemble appears to be OK. It is the Solr-related stuff that is borked. 
There are 110 items in /overseer/collection-queue-work/, which doesn’t seem 
healthy.

If it is really hosed, I’ll shut down all the nodes, clean out the files in 
Zookeeper and start over.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 22, 2019, at 8:53 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Good luck, this kind of assumes that your ZK ensemble is healthy of course...
> 
>> On May 22, 2019, at 8:23 AM, Walter Underwood <wun...@wunderwood.org> wrote:
>> 
>> Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we 
>> did a rolling restart yesterday.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 22, 2019, at 8:21 AM, Erick Erickson <erickerick...@gmail.com> wrote:
>>> 
>>> Walter:
>>> 
>>> I have no idea what the root cause is here, this really shouldn’t happen. 
>>> But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is 
>>> assigned similarly to a shard leader, the same election process happens. 
>>> All the election nodes are ephemeral ZK nodes.
>>> 
>>> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can 
>>> assign a preferred role of Overseer in those (rare) cases where there are 
>>> so many state changes for ZooKeeper that it’s advisable for them to run on 
>>> a dedicated machine.
>>> 
>>> Overseer assignment is automatic. This should work;
>>> 1> shut everything down, Solr and Zookeeper
>>> 2> start your ZooKeepers and let them all get in sync with each other
>>> 3> start your Solr nodes. It might take 3 minutes or more to bring up the 
>>> first Solr node, there’s up to a 180 second delay if leaders are not 
>>> findable easily.
>>> 
>>> That should cause Solr to elect an overseer, probably the first Solr node 
>>> to come up.
>>> 
>>> It _might_ work to bounce just one Solr node, seeing the Overseer election 
>>> queue empty it may elect itself. That said, the overseer election queue 
>>> won’t contain the rest of the Solr nodes like it should, so if that works 
>>> you should probably bounce the rest of the Solr servers one by one to 
>>> restore the proper election queue process.
>>> 
>>> Not a fix for the root cause of course, but should get things operating 
>>> again. I’ll add that I haven’t seen this happen in the field to my 
>>> recollection, if at all.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On May 21, 2019, at 9:04 PM, Will Martin <wmar...@urgent.ly> wrote:
>>>> 
>>>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>>>> 
>>>> Before blowing it away, you could try:
>>>> 
>>>> - id a candidate node, with a snapshot you just might think is old enough
>>>> to be robust.
>>>> - clean data for zk nodes otherwise.
>>>> - bring up the chosen node and wait for it to settle[wish i could remember
>>>> why i called what i saw that]
>>>> - bring up other nodes 1 at a time.  let each one fully sync to follower of
>>>> the new leader.
>>>> - they should each in turn request the snapshot from the lead. then you
>>>> have
>>>> 
>>>> : align your collections with the ensemble. and for the life of me i can't
>>>> remember there being anything particularly tricky about that with fusion ,
>>>> which means I can't remember what I did... or have it doc'd at home. ;-)
>>>> 
>>>> 
>>>> Will Martin
>>>> DEVOPS ENGINEER
>>>> 540.454.9565
>>>> 
>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>> VIENNA, VA 22182
>>>> geturgently.com
>>>> 
>>>> 
>>>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> Yes, please. I have the logs from each of the Zookeepers.
>>>>> 
>>>>> We are running 3.4.12.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>>> On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote:
>>>>>> 
>>>>>> Walter. Can I cross-post to zk-dev?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Will Martin
>>>>>> DEVOPS ENGINEER
>>>>>> 540.454.9565
>>>>>> 
>>>>>> <urgently-email-logo>
>>>>>> 
>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>> VIENNA, VA 22182
>>>>>> geturgently.com <http://geturgently.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto:
>>>>> wmar...@urgent.ly>> wrote:
>>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> Will Martin
>>>>>>> DEVOPS ENGINEER
>>>>>>> 540.454.9565
>>>>>>> 
>>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>>> VIENNA, VA 22182
>>>>>>> geturgently.com <http://geturgently.com/>
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org
>>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>>>> state for the cluster, so that is a pretty serious bug.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>>>>> blog)
>>>>>>> 
>>>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org
>>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>>>> 
>>>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer. In
>>>>> /overseer_elect on ZK, there is an election folder, but no leader 
>>>>> document.
>>>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>>>> 
>>>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear any other
>>>>> ideas. We’ve diverted all the traffic to the backing cluster, so we can
>>>>> blow this one away and rebuild.
>>>>>>>> 
>>>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>>>> failures across all three nodes.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>> (my blog)
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
> 

Reply via email to