Re: Replicas stuck in DOWN state

David Smiley Thu, 29 Apr 2021 15:18:16 -0700

I think SolrCloud ought to make a conditional state change based on the
ZooKeeper version of live_nodes.  Thus a request to change the node's state
would fail if the request included an old state version.  In this case the
client would re-fetch the state and retry or change its mind on whether
it's necessary based on new information.


I noticed there's now a JIRA issue:
https://issues.apache.org/jira/browse/SOLR-15386 so I'll comment there as
well.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Apr 28, 2021 at 5:15 AM Ilan Ginzburg <ilans...@gmail.com> wrote:

> That's possible. It does break some tests but most important will likely
> not cover all cases (node up during the massive ZK update).
>
> Le mer. 28 avr. 2021 à 11:08, Jan Høydahl <janhoy-apa...@cominvent.com> a
> écrit :
>
>> Could the Overseer do a simple live_nodes check before executing the
>> DOWNNODE message? If the node has a more recent entry in live_nodes than
>> the DOWNODE msg then drop it? Not sure if this is at all possible?
>>
>> Jan
>>
>> 28. apr. 2021 kl. 10:18 skrev Ilan Ginzburg <ilans...@gmail.com>:
>>
>> When a SolrCloud node goes down and back up in relatively rapid sequence
>> (not unusual in Public Cloud environments), it appears possible that the
>> DOWNNODE cluster state change message gets processed (or completes
>> processing) after the node has restarted.
>>
>> This delayed execution will then mark replicas on that node as DOWN, and
>> to my knowledge no existing mechanism will bring them back to ACTIVE
>> without manual intervention.
>>
>> There are simple code changes to significantly reduce that race window
>> but eliminating it completely seems challenging.
>>
>> Opinions?
>>
>> Ilan
>>
>>
>>

Re: Replicas stuck in DOWN state

Reply via email to