Hello Solr Community,

We recently experienced a load incident on a 10-node SolrCloud cluster and
are trying to understand whether Solr provides a way to stop routing
traffic to replicas or nodes that are already under stress before they
destabilise the cluster.

Environment

   -

   Solr 9.6.1 / Lucene 9.10.0
   -

   Java 17, G1GC, MaxGCPauseMillis=250
   -

   Heap: 12 GB (-Xms12g -Xmx12g)
   -

   10 Solr nodes on GCP
   -

   3-node ZooKeeper ensemble
   -

   allowPartialResults=true

Node hardware

   -

   16 vCPU
   -

   48 GB RAM
   -

   No swap
   -

   Linux (RHEL 9)

Collections

search_collection_a

   -

   63 shards, 66 replicas
   -

   ~173 GB index
   -

   ~96M documents

search_collection_b

   -

   63 shards, 63 replicas
   -

   ~205 GB index
   -

   ~95M documents

Each node hosts roughly 5-8 shards from each collection.

Incident Summary

During a period of elevated query load, several nodes became unstable,
exhibiting increased latency, GC pressure, and thread pool saturation.

We enabled the Solr CPU Circuit Breaker with a threshold of 85%, expecting
it to shed load from overloaded nodes. Instead:

   -

   The cluster began returning a large number of HTTP 429 responses.

   -

   CPU utilisation observed at the OS level remained well below the
   configured threshold.

   -

   We eventually disabled the Circuit Breaker because it appeared to be
   worsening availability.

Our working theory is that GC pressure and request backlog may have been
the primary bottlenecks rather than raw CPU utilisation, but we're trying
to understand whether Solr has built-in mechanisms for handling this
scenario.

Questions

   1.

   Replica/node exclusion based on health

   Is there a built-in mechanism in SolrCloud 9.x to temporarily avoid
   routing requests to replicas that are degraded (for example: high latency,
   GC pressure, thread pool saturation, or slow responses)?



   2.

   Latency-aware or load-aware replica selection

   Can replica selection be influenced by observed response latency or node
   load so that healthy replicas are preferred, and slow replicas are
   deprioritised?



   3.

   Circuit Breaker behavior

   Has anyone observed CPU Circuit Breakers triggering while node-level CPU
   utilisation appears to remain below the configured threshold? Are there
   additional Circuit Breakers in Solr 9.x or an external library that are
   generally more effective than CPU thresholds for detecting overload caused
   by GC pressure or resource contention?

Any guidance, documentation references, JIRA issues, or production
experience would be greatly appreciated.

Thank you
Harshit Sharma

Reply via email to