Could the Overseer do a simple live_nodes check before executing the DOWNNODE message? If the node has a more recent entry in live_nodes than the DOWNODE msg then drop it? Not sure if this is at all possible?
Jan > 28. apr. 2021 kl. 10:18 skrev Ilan Ginzburg <ilans...@gmail.com>: > > When a SolrCloud node goes down and back up in relatively rapid sequence (not > unusual in Public Cloud environments), it appears possible that the DOWNNODE > cluster state change message gets processed (or completes processing) after > the node has restarted. > > This delayed execution will then mark replicas on that node as DOWN, and to > my knowledge no existing mechanism will bring them back to ACTIVE without > manual intervention. > > There are simple code changes to significantly reduce that race window but > eliminating it completely seems challenging. > > Opinions? > > Ilan