That's possible. It does break some tests but most important will likely not cover all cases (node up during the massive ZK update).
Le mer. 28 avr. 2021 à 11:08, Jan Høydahl <janhoy-apa...@cominvent.com> a écrit : > Could the Overseer do a simple live_nodes check before executing the > DOWNNODE message? If the node has a more recent entry in live_nodes than > the DOWNODE msg then drop it? Not sure if this is at all possible? > > Jan > > 28. apr. 2021 kl. 10:18 skrev Ilan Ginzburg <ilans...@gmail.com>: > > When a SolrCloud node goes down and back up in relatively rapid sequence > (not unusual in Public Cloud environments), it appears possible that the > DOWNNODE cluster state change message gets processed (or completes > processing) after the node has restarted. > > This delayed execution will then mark replicas on that node as DOWN, and > to my knowledge no existing mechanism will bring them back to ACTIVE > without manual intervention. > > There are simple code changes to significantly reduce that race window but > eliminating it completely seems challenging. > > Opinions? > > Ilan > > >