Re: Replicas stuck in DOWN state

Ilan Ginzburg Wed, 28 Apr 2021 02:15:03 -0700

That's possible. It does break some tests but most important will likely
not cover all cases (node up during the massive ZK update).


Le mer. 28 avr. 2021 à 11:08, Jan Høydahl <[email protected]> a
écrit :

> Could the Overseer do a simple live_nodes check before executing the
> DOWNNODE message? If the node has a more recent entry in live_nodes than
> the DOWNODE msg then drop it? Not sure if this is at all possible?
>
> Jan
>
> 28. apr. 2021 kl. 10:18 skrev Ilan Ginzburg <[email protected]>:
>
> When a SolrCloud node goes down and back up in relatively rapid sequence
> (not unusual in Public Cloud environments), it appears possible that the
> DOWNNODE cluster state change message gets processed (or completes
> processing) after the node has restarted.
>
> This delayed execution will then mark replicas on that node as DOWN, and
> to my knowledge no existing mechanism will bring them back to ACTIVE
> without manual intervention.
>
> There are simple code changes to significantly reduce that race window but
> eliminating it completely seems challenging.
>
> Opinions?
>
> Ilan
>
>
>

Re: Replicas stuck in DOWN state

Reply via email to