When a SolrCloud node goes down and back up in relatively rapid sequence
(not unusual in Public Cloud environments), it appears possible that the
DOWNNODE cluster state change message gets processed (or completes
processing) after the node has restarted.


This delayed execution will then mark replicas on that node as DOWN, and to
my knowledge no existing mechanism will bring them back to ACTIVE without
manual intervention.


There are simple code changes to significantly reduce that race window but
eliminating it completely seems challenging.


Opinions?


Ilan

Reply via email to