joshsouza opened a new issue, #471:
URL: https://github.com/apache/solr-operator/issues/471

   We are just starting out with the Solr Operator, and intend on moving 
several large Solr clusters over to leveraging the operator for their 
management. In our initial tests, we've encountered a situation that seems 
incredibly risky, and we would like to understand whether there is a reasonable 
solution for this in place, or good suggestions for how to improve reliability 
around it.
   
   The logic around `SolrCloud.Spec.updateStrategy` being `Managed` 
(https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy)
 means that the operator will never take an action that risks cluster stability 
(shutting down a pod that would result in no live replicas etc...) This is 
fantastic, but only relates to actions that the operator itself would make 
(statefulset updates etc...), and doesn't appear to come into play during 
normal _kubernetes_ operations, such as node rotations.
   
   On an EKS cluster, when a node group is refreshed, the nodes are marked for 
termination within their autoscaling groups, and subsequently their pods are 
drained from the nodes that are to be shut down, and re-scheduled to valid 
nodes. Normal k8s operations to prevent service disruptions during this type of 
an event are to utilize Pod Disruption Budgets, which prevents the draining 
nodes from stopping their pods if it would cause a disruption. This leverages 
Readiness/Liveness status to determine when a disruption would occur, and is 
generally a reliable way of preventing applications from becoming unavailable.
   
   With Solr, there is another level of abstraction, as a Solr pod being 
"ready" doesn't mean that all of the cores on that node are 
available/replicated, and thus a pod disruption budget, which only monitors 
that readiness state, may perceive that it is safe to delete an arbitrary pod 
in the cluster without the necessary logic (which the Operator has) of checking 
whether that pod would cause a disruption should it be shut down.
   
   Since with a large cluster, nodes/pods coming up and down may take time to 
recover, and without a PDB, you may risk multiple pods going down 
simultaneously, there is a risk (we perceive) that Solr's availability could be 
at risk should a node rotation or other form of pod deletion etc... occur 
outside the Operator's pervue.
   
   So, my question is:
   What methodology is recommended for eliminating this risk? Are there 
configurations we've overlooked that will reduce this risk? Has the community 
simply accepted this limitation and found ways to reduce the odds of being 
impacted? (are we maybe overreacting, and this isn't actually a risk?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to