When a SolrCloud node goes down and back up in relatively rapid sequence (not unusual in Public Cloud environments), it appears possible that the DOWNNODE cluster state change message gets processed (or completes processing) after the node has restarted.
This delayed execution will then mark replicas on that node as DOWN, and to my knowledge no existing mechanism will bring them back to ACTIVE without manual intervention. There are simple code changes to significantly reduce that race window but eliminating it completely seems challenging. Opinions? Ilan