Could the Overseer do a simple live_nodes check before executing the DOWNNODE 
message? If the node has a more recent entry in live_nodes than the DOWNODE msg 
then drop it? Not sure if this is at all possible?

Jan

> 28. apr. 2021 kl. 10:18 skrev Ilan Ginzburg <ilans...@gmail.com>:
> 
> When a SolrCloud node goes down and back up in relatively rapid sequence (not 
> unusual in Public Cloud environments), it appears possible that the DOWNNODE 
> cluster state change message gets processed (or completes processing) after 
> the node has restarted.
> 
> This delayed execution will then mark replicas on that node as DOWN, and to 
> my knowledge no existing mechanism will bring them back to ACTIVE without 
> manual intervention.
> 
> There are simple code changes to significantly reduce that race window but 
> eliminating it completely seems challenging.
> 
> Opinions?
> 
> Ilan

Reply via email to