[ 
https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073235#comment-17073235
 ] 

Jan Høydahl commented on SOLR-14210:
------------------------------------

This is ready for broader review.

So if you just call {{/api/node/health}} there is no change. But if you call 
{{/api/node/health?requireHealthyCores=true}}, then the new checks will kick in:
 * Return 200 OK only if all replicas from clusterstate on this node are ACTIVE 
(or RECOVERY_FAILED), and that the core actually exists in CoreContainer
 * If OK, an extra {{"message": "All cores are healthy"}} is added to the 
response JSON
 * If one of the replicas for an *active* shard on the node is DOWN or 
RECOVERING, then 503 is returned with error text {{"error": "Replica(s) [foo, 
bar] are currently initializing or recovering"}}. We do not care about inactive 
shards.

The extra checks will add some extra CPU load for looping through clusterstate 
objects, and this is not benchmarked. Only for very large clusters with 
thousands of shards would that be a theoretical issue, and only if health 
endpoint is hit very frequently. Reviewers are encouraged to [give feedback on 
the findUnhealthyCores() 
method|https://github.com/apache/lucene-solr/pull/1387/files#diff-9f25e225f70fa66d2b27079a9511c0daR124-R138].

I'm targeting 8.6 with this

> Introduce Node-level status handler for replicas
> ------------------------------------------------
>
>                 Key: SOLR-14210
>                 URL: https://issues.apache.org/jira/browse/SOLR-14210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.5
>            Reporter: Houston Putman
>            Assignee: Jan Høydahl
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native 
> way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe 
> explained in 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n]
>  determine if a node is live and ready to serve live traffic.
> {quote}
>  
> However there are issues around kubernetes managing it's own rolling 
> restarts. With the current healthcheck setup, it's easy to envision a 
> scenario in which Solr reports itself as "healthy" when all of its replicas 
> are actually recovering. Therefore kubernetes, seeing a healthy pod would 
> then go and restart the next Solr node. This can happen until all replicas 
> are "recovering" and none are healthy. (maybe the last one restarted will be 
> "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all 
> replicas hosted by that Solr node are healthy and "active". That way we will 
> be able to use the [default kubernetes rolling restart 
> logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies]
>  with Solr.
> To add on to [Jan's point 
> here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559],
>  this handler should be more friendly for other Content-Types and should use 
> bettter HTTP response statuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to