HoustonPutman commented on code in PR #4432:
URL: https://github.com/apache/solr/pull/4432#discussion_r3351062300


##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerRegistry.java:
##########
@@ -210,6 +226,85 @@ public List<CircuitBreaker> checkTripped(SolrRequestType 
requestType) {
     return triggeredCircuitBreakers;
   }
 
+  /**
+   * Check globally-registered (process-wide) circuit breakers for the given 
request type. Warn-only
+   * breakers are excluded from the result so that callers performing 
admission control do not act
+   * on them.
+   *
+   * <p>Filter-level callers (e.g. {@link 
org.apache.solr.servlet.SolrQoSFilter}) do not have
+   * per-core context and therefore consult only the static global map.
+   *
+   * @return null if no global breakers are registered for {@code requestType} 
or none are tripped;
+   *     otherwise a non-empty list of tripped breakers
+   */
+  public static List<CircuitBreaker> checkTrippedGlobal(SolrRequestType 
requestType) {
+    return selectTripped(globalCircuitBreakerMap.get(requestType));
+  }
+
+  /**
+   * Per-core variant of {@link #checkTrippedGlobal(SolrRequestType)} that 
consults only this
+   * registry's local breakers. Warn-only breakers are excluded.
+   */
+  public List<CircuitBreaker> checkTrippedLocal(SolrRequestType requestType) {
+    List<CircuitBreaker> snapshot;
+    synchronized (circuitBreakerMap) {
+      List<CircuitBreaker> breakersOfType = circuitBreakerMap.get(requestType);

Review Comment:
   Why not just change `circuitBreakerMap` to be a `ConcurrentHashMap`?



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerRegistry.java:
##########
@@ -210,6 +226,85 @@ public List<CircuitBreaker> checkTripped(SolrRequestType 
requestType) {
     return triggeredCircuitBreakers;
   }
 
+  /**
+   * Check globally-registered (process-wide) circuit breakers for the given 
request type. Warn-only
+   * breakers are excluded from the result so that callers performing 
admission control do not act
+   * on them.
+   *
+   * <p>Filter-level callers (e.g. {@link 
org.apache.solr.servlet.SolrQoSFilter}) do not have
+   * per-core context and therefore consult only the static global map.
+   *
+   * @return null if no global breakers are registered for {@code requestType} 
or none are tripped;
+   *     otherwise a non-empty list of tripped breakers
+   */
+  public static List<CircuitBreaker> checkTrippedGlobal(SolrRequestType 
requestType) {
+    return selectTripped(globalCircuitBreakerMap.get(requestType));
+  }
+
+  /**
+   * Per-core variant of {@link #checkTrippedGlobal(SolrRequestType)} that 
consults only this
+   * registry's local breakers. Warn-only breakers are excluded.
+   */
+  public List<CircuitBreaker> checkTrippedLocal(SolrRequestType requestType) {
+    List<CircuitBreaker> snapshot;
+    synchronized (circuitBreakerMap) {
+      List<CircuitBreaker> breakersOfType = circuitBreakerMap.get(requestType);
+      snapshot = (breakersOfType == null) ? null : new 
ArrayList<>(breakersOfType);
+    }
+    return selectTripped(snapshot);
+  }
+
+  private static List<CircuitBreaker> selectTripped(List<CircuitBreaker> 
breakers) {
+    if (breakers == null || breakers.isEmpty()) {
+      return null;
+    }
+    List<CircuitBreaker> tripped = null;
+    for (CircuitBreaker cb : breakers) {
+      if (cb.isTripped() && !cb.isWarnOnly()) {
+        if (tripped == null) {
+          tripped = new ArrayList<>();
+        }
+        tripped.add(cb);
+      }
+    }
+    return tripped;
+  }
+
+  /**
+   * Filter-level admission control helper: combine globally-registered and 
every
+   * per-core-registered circuit breaker for {@code requestType}. Returns null 
if nothing is
+   * tripped, otherwise a non-empty list of tripped breakers.
+   *
+   * <p>Iteration is over all cores in {@code cc}. If any core has a tripped 
breaker for the type,
+   * the request is treated as tripped cluster-wide; this is intentionally 
conservative since the
+   * filter does not yet know which core a request will resolve to.

Review Comment:
   Doesn't this make the core-scoped CBs partially global at this point? If the 
filter doesn't know yet which core a request will resolve to, why doesn't it 
wait to check the CBs then?



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.management.GarbageCollectorMXBean;
+import java.lang.management.ManagementFactory;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.solr.common.util.EnvUtils;
+
+/**
+ * Trips when the JVM is spending more than a configured percentage of 
wall-clock time in garbage
+ * collection over a sliding window.
+ *
+ * <p>Complementary to {@link MemoryCircuitBreaker}: that breaker fires when 
post-GC live data is
+ * exhausting the heap; this one fires when GC is keeping up (live data may 
even be small) but
+ * consuming so much CPU that the application is starving. Both conditions 
usually precede an OOM
+ * but each catches the other's blind spot.
+ *
+ * <p>The percentage is computed as
+ *
+ * <pre>{@code
+ * sum(GarbageCollectorMXBean.getCollectionTime()) over window
+ *   ─────────────────────────────────────────────────────────  × 100
+ *                       wall-clock window
+ * }</pre>
+ *
+ * <p>The window length defaults to {@link #DEFAULT_WINDOW_SECONDS} seconds 
and is configurable via
+ * {@value #SYSPROP_WINDOW_SECONDS}. Within one window the breaker reports the 
GC ratio over the
+ * time elapsed since the anchor sample; once the window length is exceeded, 
the anchor slides
+ * forward to the most recent sample so the ratio reflects only recent 
behavior.
+ *
+ * <p>{@link #isTripped()} is rate-limited by the same {@code 
TtlSampledMetric} used by the
+ * load-average and CPU breakers ({@value 
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}), so
+ * concurrent admission-control callers share one ratio computation per cache 
window.
+ */
+public class GcOverheadCircuitBreaker extends CircuitBreaker {
+
+  public static final String SYSPROP_WINDOW_SECONDS =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "gcoverhead.windowSeconds";
+  public static final long DEFAULT_WINDOW_SECONDS = 30L;
+
+  private static final List<GarbageCollectorMXBean> GC_BEANS =
+      List.copyOf(ManagementFactory.getGarbageCollectorMXBeans());
+
+  private double overheadThresholdPercent;
+  private final long windowNanos;
+  private final TtlSampledMetric ratioCache;
+  private final AtomicReference<Sample> anchor = new AtomicReference<>();
+
+  private static final ThreadLocal<Double> seenOverheadPercent = 
ThreadLocal.withInitial(() -> 0.0);
+  private static final ThreadLocal<Double> allowedOverheadPercent =
+      ThreadLocal.withInitial(() -> 0.0);
+
+  public GcOverheadCircuitBreaker() {
+    super();

Review Comment:
   ```suggestion
   ```



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########


Review Comment:
   So I like where this is going, but I'm afraid it's a little heavy-handed. 
Right now we will basically stop all requests for 30s if GC is a lot. and then 
see after 30s if it has gone down. Instead should we have a poll rate and use a 
Dequeue to keep track of a rolling window of GC times (and every poll time we 
subtract the first entry and add the last entry to our numbers)? That way we 
can say check every 100ms, but still use a rolling window of 30s to get a rate.



##########
solr/core/src/java/org/apache/solr/servlet/SolrQoSFilter.java:
##########
@@ -0,0 +1,733 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import com.google.common.annotations.VisibleForTesting;
+import io.opentelemetry.api.metrics.LongCounter;
+import io.opentelemetry.api.metrics.ObservableLongGauge;
+import jakarta.servlet.AsyncContext;
+import jakarta.servlet.AsyncEvent;
+import jakarta.servlet.AsyncListener;
+import jakarta.servlet.FilterChain;
+import jakarta.servlet.FilterConfig;
+import jakarta.servlet.ServletException;
+import jakarta.servlet.UnavailableException;
+import jakarta.servlet.http.HttpServletRequest;
+import jakarta.servlet.http.HttpServletResponse;
+import java.io.IOException;
+import java.lang.invoke.MethodHandles;
+import java.util.List;
+import java.util.Locale;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.ScheduledFuture;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.solr.client.solrj.SolrRequest.SolrRequestType;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.util.EnvUtils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.metrics.SolrMetricManager;
+import org.apache.solr.util.circuitbreaker.CircuitBreaker;
+import org.apache.solr.util.circuitbreaker.CircuitBreakerRegistry;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Quality-of-Service filter that asynchronously suspends requests when 
circuit breakers (global or
+ * per-core) are tripped, instead of failing fast with a {@code 429}. 
Suspended requests are queued
+ * by priority (and request type) and resumed by a periodic scheduler that 
re-checks the {@link
+ * CircuitBreakerRegistry}; once no breaker is tripped for a request's type, 
the request is
+ * dispatched to the rest of the chain.
+ *
+ * <p>Designed after Jetty's {@code QoSFilter}/{@code QoSHandler}: instead of 
dropping load during
+ * transient spikes (e.g. GC pauses), the filter buffers requests, prioritizes 
intra-cluster shard /
+ * admin / health-check traffic over user queries over background updates, and 
shed-loads only when
+ * the suspension queue saturates or a request's suspension timeout elapses.
+ *
+ * <p>This filter always synchronously checks circuit breakers on the request 
path. When QoS
+ * queueing is enabled (the default; toggle via {@value 
#SYSPROP_QOS_ENABLED}), a tripped breaker
+ * causes the request to be suspended (via {@link 
HttpServletRequest#startAsync()}) and queued by
+ * priority + type until the breaker clears. When QoS queueing is disabled, a 
tripped breaker causes
+ * a synchronous fail-fast {@code 429} — preserving the legacy {@code 
SearchHandler} / {@code
+ * ContentStreamHandlerBase} behavior that previously enforced the same check 
inside the handlers.
+ *
+ * <p>Internal intra-cluster shard requests (those carrying {@code 
Solr-Request-Context: SERVER})
+ * always bypass both the breaker check and suspension to avoid distributed 
deadlock — sub-shard
+ * requests cannot be queued waiting on a parent that is itself waiting.
+ *
+ * <p>All filters and the SolrServlet on the request path must declare {@code
+ * <async-supported>true</async-supported>}; otherwise {@link 
HttpServletRequest#startAsync()} will
+ * throw at runtime when a circuit breaker first trips.
+ *
+ * @lucene.experimental
+ */
+public class SolrQoSFilter extends CoreContainerAwareHttpFilter {

Review Comment:
   This is a lot. Would it be possible to have The Solr RequestHandlerBase 
extend Jetty's `QoSHandler`? And we could have a reference to a shared class 
that keeps track of the limits together? Just don't want to essentially fork 
the Jetty QoS code.
   
   I mean there's the `QoSFilter` we could possibly use too, but I'm not sure 
it would be as easy to hook up. And its deprecated.



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/TtlSampledMetric.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.DoubleSupplier;
+
+/**
+ * Time-bounded cache around an expensive double-valued metric sample. 
Concurrent readers within one
+ * TTL window share a single computed value; on cache miss, multiple threads 
may each compute before
+ * the latest store wins, which is bounded and acceptable.
+ *
+ * <p>Used by {@link LoadAverageCircuitBreaker} and {@link CPUCircuitBreaker} 
so that high-QPS
+ * admission control does not poll OS load-average syscalls or Prometheus 
metric scans more often
+ * than the underlying signals can move.
+ */
+final class TtlSampledMetric {
+
+  private final long ttlNanos;
+  private final AtomicReference<Sample> sample = new AtomicReference<>();
+
+  TtlSampledMetric(long ttlMs) {
+    this.ttlNanos = TimeUnit.MILLISECONDS.toNanos(ttlMs);
+  }
+
+  double get(DoubleSupplier source) {

Review Comment:
   Might make more sense for the `DoubleSupplier` (or generic `Supplier<T>`) to 
be provided in the constructor...



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java:
##########
@@ -96,19 +86,65 @@ public MemoryCircuitBreaker setThreshold(double 
thresholdValueInPercentage) {
 
   @Override
   public boolean isTripped() {
-
-    long localAllowedMemory = getCurrentMemoryThreshold();
+    long localAllowedMemory = heapMemoryThreshold;
     long localSeenMemory = getAvgMemoryUsage();
 
     allowedMemory.set(localAllowedMemory);
-
     seenMemory.set(localSeenMemory);
 
-    return (localSeenMemory >= localAllowedMemory);
+    return localSeenMemory >= localAllowedMemory;
   }
 
+  /**
+   * Returns post-GC live bytes in the old-gen pool (or the sum across all 
heap pools when no
+   * dedicated old-gen pool exists). Cached for {@link 
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}
+   * so high-QPS callers don't repeatedly walk {@link MemoryPoolMXBean} list.
+   *
+   * <p>The historical name is preserved for source-compatibility with 
subclasses that override the
+   * value source for testing; the implementation no longer averages anything.
+   */
   protected long getAvgMemoryUsage() {
-    return (long) averagingMetricProvider.get().getMetricValue();
+    return (long) 
heapLiveCache.get(MemoryCircuitBreaker::samplePostGcLiveBytes);

Review Comment:
   So the other CB classes will use the cached value read in `isTripped()` and 
leave `getAvgMemoryUsage()` to be where the logic lives to calculate. Should 
`getAvgMemoryUsage()` just have the logic of `samplePostGcLiveBytes()` and then 
`isTripped()` can call `heapLiveCache.get(this::getAvgMemoryUsage)`?



##########
solr/core/src/java/org/apache/solr/servlet/InternalRequestUtils.java:
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import jakarta.servlet.http.HttpServletRequest;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.common.params.CommonParams;
+
+/**
+ * Shared detection for internal intra-cluster server-to-server requests. 
Filter-tier admission
+ * control (e.g. {@link SolrQoSFilter}, {@link RateLimitManager}) consults 
this so sub-shard
+ * requests are unconditionally admitted — refusing them would risk 
distributed deadlock, since the
+ * parent request is already blocking on the sub-shard.
+ *
+ * <p>Mirrors {@code RequestHandlerBase.isInternalShardRequest} at the HTTP 
layer so the filter-tier
+ * definition stays consistent with the handler-tier one.
+ */
+final class InternalRequestUtils {
+
+  private InternalRequestUtils() {}
+
+  /**
+   * Whether {@code req} is an internal intra-cluster shard / admin request. 
Identified by any of:
+   *
+   * <ul>
+   *   <li>{@code Solr-Request-Context: SERVER} header — set by SolrJ on 
server-to-server traffic.
+   *   <li>{@code isShard=true} query parameter — set by {@code ShardHandler} 
on sub-shard fan-out.
+   *   <li>{@code distrib.from=…} query parameter — set by {@code 
DistributedUpdateProcessor} when
+   *       forwarding update requests between replicas.
+   * </ul>
+   */
+  static boolean isInternalServerRequest(HttpServletRequest req) {

Review Comment:
   Isn't this something that an attacker could use to bypass the rate limiters?



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerRegistry.java:
##########
@@ -52,16 +52,27 @@
 public class CircuitBreakerRegistry implements Closeable {
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
   private static final Pattern SYSPROP_REGEX =
-      
Pattern.compile("solr.circuitbreaker\\.(update|query)\\.(cpu|mem|loadavg)");
+      
Pattern.compile("solr.circuitbreaker\\.(update|query)\\.(cpu|mem|loadavg|gcoverhead)");
   public static final String SYSPROP_PREFIX = "solr.circuitbreaker.";
   public static final String SYSPROP_UPDATE_CPU = SYSPROP_PREFIX + 
"update.cpu";
   public static final String SYSPROP_UPDATE_MEM = SYSPROP_PREFIX + 
"update.mem";
   public static final String SYSPROP_UPDATE_LOADAVG = SYSPROP_PREFIX + 
"update.loadavg";
+  public static final String SYSPROP_UPDATE_GCOVERHEAD = SYSPROP_PREFIX + 
"update.gcoverhead";
   public static final String SYSPROP_QUERY_CPU = SYSPROP_PREFIX + "query.cpu";
   public static final String SYSPROP_QUERY_MEM = SYSPROP_PREFIX + "query.mem";
   public static final String SYSPROP_QUERY_LOADAVG = SYSPROP_PREFIX + 
"query.loadavg";
+  public static final String SYSPROP_QUERY_GCOVERHEAD = SYSPROP_PREFIX + 
"query.gcoverhead";
   public static final String SYSPROP_WARN_ONLY_SUFFIX = ".warnonly";
 
+  /**
+   * Default TTL (ms) of cached load-average / CPU samples consulted by {@link
+   * LoadAverageCircuitBreaker} and {@link CPUCircuitBreaker}. Override 
per-process via the {@value
+   * #SYSPROP_SAMPLE_TTL_MS} system property.
+   */
+  public static final long DEFAULT_SAMPLE_TTL_MS = 1000L;

Review Comment:
   I would probably change to 100ms?



##########
solr/core/src/java/org/apache/solr/servlet/InternalRequestUtils.java:
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import jakarta.servlet.http.HttpServletRequest;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.common.params.CommonParams;
+
+/**
+ * Shared detection for internal intra-cluster server-to-server requests. 
Filter-tier admission
+ * control (e.g. {@link SolrQoSFilter}, {@link RateLimitManager}) consults 
this so sub-shard
+ * requests are unconditionally admitted — refusing them would risk 
distributed deadlock, since the
+ * parent request is already blocking on the sub-shard.
+ *
+ * <p>Mirrors {@code RequestHandlerBase.isInternalShardRequest} at the HTTP 
layer so the filter-tier
+ * definition stays consistent with the handler-tier one.
+ */
+final class InternalRequestUtils {
+
+  private InternalRequestUtils() {}
+
+  /**
+   * Whether {@code req} is an internal intra-cluster shard / admin request. 
Identified by any of:
+   *
+   * <ul>
+   *   <li>{@code Solr-Request-Context: SERVER} header — set by SolrJ on 
server-to-server traffic.
+   *   <li>{@code isShard=true} query parameter — set by {@code ShardHandler} 
on sub-shard fan-out.
+   *   <li>{@code distrib.from=…} query parameter — set by {@code 
DistributedUpdateProcessor} when
+   *       forwarding update requests between replicas.
+   * </ul>
+   */
+  static boolean isInternalServerRequest(HttpServletRequest req) {

Review Comment:
   Ok I see this isn't something new. You are just restructuring.



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.management.GarbageCollectorMXBean;
+import java.lang.management.ManagementFactory;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.solr.common.util.EnvUtils;
+
+/**
+ * Trips when the JVM is spending more than a configured percentage of 
wall-clock time in garbage
+ * collection over a sliding window.
+ *
+ * <p>Complementary to {@link MemoryCircuitBreaker}: that breaker fires when 
post-GC live data is
+ * exhausting the heap; this one fires when GC is keeping up (live data may 
even be small) but
+ * consuming so much CPU that the application is starving. Both conditions 
usually precede an OOM
+ * but each catches the other's blind spot.
+ *
+ * <p>The percentage is computed as
+ *
+ * <pre>{@code
+ * sum(GarbageCollectorMXBean.getCollectionTime()) over window
+ *   ─────────────────────────────────────────────────────────  × 100
+ *                       wall-clock window
+ * }</pre>
+ *
+ * <p>The window length defaults to {@link #DEFAULT_WINDOW_SECONDS} seconds 
and is configurable via
+ * {@value #SYSPROP_WINDOW_SECONDS}. Within one window the breaker reports the 
GC ratio over the
+ * time elapsed since the anchor sample; once the window length is exceeded, 
the anchor slides
+ * forward to the most recent sample so the ratio reflects only recent 
behavior.
+ *
+ * <p>{@link #isTripped()} is rate-limited by the same {@code 
TtlSampledMetric} used by the
+ * load-average and CPU breakers ({@value 
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}), so
+ * concurrent admission-control callers share one ratio computation per cache 
window.
+ */
+public class GcOverheadCircuitBreaker extends CircuitBreaker {
+
+  public static final String SYSPROP_WINDOW_SECONDS =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "gcoverhead.windowSeconds";
+  public static final long DEFAULT_WINDOW_SECONDS = 30L;
+
+  private static final List<GarbageCollectorMXBean> GC_BEANS =
+      List.copyOf(ManagementFactory.getGarbageCollectorMXBeans());
+
+  private double overheadThresholdPercent;
+  private final long windowNanos;
+  private final TtlSampledMetric ratioCache;
+  private final AtomicReference<Sample> anchor = new AtomicReference<>();
+
+  private static final ThreadLocal<Double> seenOverheadPercent = 
ThreadLocal.withInitial(() -> 0.0);
+  private static final ThreadLocal<Double> allowedOverheadPercent =
+      ThreadLocal.withInitial(() -> 0.0);
+
+  public GcOverheadCircuitBreaker() {
+    super();
+    long windowSeconds = EnvUtils.getPropertyAsLong(SYSPROP_WINDOW_SECONDS, 
DEFAULT_WINDOW_SECONDS);
+    if (windowSeconds <= 0) {
+      throw new IllegalArgumentException(
+          "Invalid GC overhead window: " + windowSeconds + "s (must be 
positive)");
+    }
+    this.windowNanos = TimeUnit.SECONDS.toNanos(windowSeconds);
+    this.ratioCache =
+        new TtlSampledMetric(
+            EnvUtils.getPropertyAsLong(
+                CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS,
+                CircuitBreakerRegistry.DEFAULT_SAMPLE_TTL_MS));
+  }
+
+  public GcOverheadCircuitBreaker setThreshold(double 
thresholdValueInPercentage) {
+    if (thresholdValueInPercentage <= 0 || thresholdValueInPercentage > 100) {
+      throw new IllegalArgumentException(
+          "GC overhead threshold must be in (0, 100]; got " + 
thresholdValueInPercentage);
+    }
+    this.overheadThresholdPercent = thresholdValueInPercentage;
+    return this;
+  }
+
+  public double getOverheadThresholdPercent() {
+    return overheadThresholdPercent;
+  }
+
+  @Override
+  public boolean isTripped() {
+    double localAllowed = overheadThresholdPercent;
+    double localSeen = ratioCache.get(this::computeGcOverheadPercent);
+
+    allowedOverheadPercent.set(localAllowed);
+    seenOverheadPercent.set(localSeen);
+
+    return localSeen >= localAllowed;

Review Comment:
   If it's called `allowed` then shouldn't we be checking that `localSeen > 
localAllowed`? Its weird that `seen == allowed` would be an error.



##########
solr/core/src/java/org/apache/solr/servlet/RateLimitManager.java:
##########


Review Comment:
   What's the point of this class now that `SolrQoSFilter` exists? Is it 
back-compat? Is it just a lighter weight version?



##########
solr/core/src/java/org/apache/solr/servlet/SolrQoSFilter.java:
##########
@@ -0,0 +1,733 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import com.google.common.annotations.VisibleForTesting;
+import io.opentelemetry.api.metrics.LongCounter;
+import io.opentelemetry.api.metrics.ObservableLongGauge;
+import jakarta.servlet.AsyncContext;
+import jakarta.servlet.AsyncEvent;
+import jakarta.servlet.AsyncListener;
+import jakarta.servlet.FilterChain;
+import jakarta.servlet.FilterConfig;
+import jakarta.servlet.ServletException;
+import jakarta.servlet.UnavailableException;
+import jakarta.servlet.http.HttpServletRequest;
+import jakarta.servlet.http.HttpServletResponse;
+import java.io.IOException;
+import java.lang.invoke.MethodHandles;
+import java.util.List;
+import java.util.Locale;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.ScheduledFuture;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.solr.client.solrj.SolrRequest.SolrRequestType;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.util.EnvUtils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.metrics.SolrMetricManager;
+import org.apache.solr.util.circuitbreaker.CircuitBreaker;
+import org.apache.solr.util.circuitbreaker.CircuitBreakerRegistry;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Quality-of-Service filter that asynchronously suspends requests when 
circuit breakers (global or
+ * per-core) are tripped, instead of failing fast with a {@code 429}. 
Suspended requests are queued
+ * by priority (and request type) and resumed by a periodic scheduler that 
re-checks the {@link
+ * CircuitBreakerRegistry}; once no breaker is tripped for a request's type, 
the request is
+ * dispatched to the rest of the chain.
+ *
+ * <p>Designed after Jetty's {@code QoSFilter}/{@code QoSHandler}: instead of 
dropping load during
+ * transient spikes (e.g. GC pauses), the filter buffers requests, prioritizes 
intra-cluster shard /
+ * admin / health-check traffic over user queries over background updates, and 
shed-loads only when
+ * the suspension queue saturates or a request's suspension timeout elapses.
+ *
+ * <p>This filter always synchronously checks circuit breakers on the request 
path. When QoS
+ * queueing is enabled (the default; toggle via {@value 
#SYSPROP_QOS_ENABLED}), a tripped breaker
+ * causes the request to be suspended (via {@link 
HttpServletRequest#startAsync()}) and queued by
+ * priority + type until the breaker clears. When QoS queueing is disabled, a 
tripped breaker causes
+ * a synchronous fail-fast {@code 429} — preserving the legacy {@code 
SearchHandler} / {@code
+ * ContentStreamHandlerBase} behavior that previously enforced the same check 
inside the handlers.
+ *
+ * <p>Internal intra-cluster shard requests (those carrying {@code 
Solr-Request-Context: SERVER})
+ * always bypass both the breaker check and suspension to avoid distributed 
deadlock — sub-shard
+ * requests cannot be queued waiting on a parent that is itself waiting.
+ *
+ * <p>All filters and the SolrServlet on the request path must declare {@code
+ * <async-supported>true</async-supported>}; otherwise {@link 
HttpServletRequest#startAsync()} will
+ * throw at runtime when a circuit breaker first trips.
+ *
+ * @lucene.experimental
+ */
+public class SolrQoSFilter extends CoreContainerAwareHttpFilter {
+  private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
+  public static final String SYSPROP_QOS_ENABLED =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.enabled";
+  public static final String SYSPROP_MAX_SUSPENDED =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.maxSuspendedRequests";
+  public static final String SYSPROP_SUSPEND_TIMEOUT_MS =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.suspendTimeoutMs";
+  public static final String SYSPROP_CHECK_INTERVAL_MS =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.checkIntervalMs";
+  public static final String SYSPROP_EVAL_INTERVAL_MS =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.evaluationIntervalMs";
+  public static final String SYSPROP_DRAIN_BUDGET =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.drainBudget";
+  public static final String SYSPROP_HIGH_SHARE =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.priority.high.share";
+  public static final String SYSPROP_MEDIUM_SHARE =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.priority.medium.share";
+  public static final String SYSPROP_HIGH_DRAIN_MULTIPLIER =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.drainBudget.highMultiplier";
+  public static final String SYSPROP_LOW_DRAIN_MULTIPLIER =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.drainBudget.lowMultiplier";
+
+  /**
+   * Optional client-supplied priority hint. Values are case-insensitive 
{@code HIGH}, {@code
+   * MEDIUM}, or {@code LOW}. Unknown values are ignored and the filter falls 
back to its path /
+   * type heuristic. Trusted callers (internal admin tooling, well-behaved 
SolrJ extensions) may use
+   * this to override the default lane assignment without the filter having to 
inspect the request
+   * body.
+   */
+  public static final String HEADER_REQUEST_PRIORITY = "Solr-Request-Priority";
+
+  static final int DEFAULT_MAX_SUSPENDED_REQUESTS = 1024;
+  static final long DEFAULT_SUSPEND_TIMEOUT_MS = 5000L;
+  static final long DEFAULT_CHECK_INTERVAL_MS = 100L;
+  static final long DEFAULT_EVAL_INTERVAL_MS = 200L;
+  static final int DEFAULT_DRAIN_BUDGET = 200;
+  static final double DEFAULT_HIGH_SHARE = 0.10;
+  static final double DEFAULT_MEDIUM_SHARE = 0.60;
+  static final double DEFAULT_HIGH_DRAIN_MULTIPLIER = 4.0;
+  static final double DEFAULT_LOW_DRAIN_MULTIPLIER = 0.5;
+
+  static final int PRIORITY_HIGH = 2; // admin/ping/health-check
+  static final int PRIORITY_MEDIUM = 1; // user QUERY
+  static final int PRIORITY_LOW = 0; // UPDATE
+  static final int N_PRIORITIES = 3;
+
+  static final int TYPE_QUERY = 0;
+  static final int TYPE_UPDATE = 1;
+  static final int N_TYPES = 2;
+
+  private static final String ATTR_RESUMED = SolrQoSFilter.class.getName() + 
".resumed";
+
+  private static final String METRIC_PREFIX = "qos.";
+
+  private boolean enabled;
+  private int maxSuspendedRequests;
+  private long suspendTimeoutMs;
+  private long checkIntervalMs;
+  private long evaluationIntervalNanos;
+  private int drainBudget;
+  private final int[] priorityCap = new int[N_PRIORITIES];
+  private final int[] priorityDrainBudget = new int[N_PRIORITIES];
+
+  // Cache of the last circuit-breaker scan per request type. Several 
admission-control checks
+  // (load average, CPU) are expensive and sourced from metrics that don't 
update faster than
+  // ~once per second anyway, so polling them per-request scales badly under 
load. We share
+  // one evaluation across all callers within evaluationIntervalNanos.
+  private volatile TrippedScan queryScan;
+  private volatile TrippedScan updateScan;
+
+  private final Queue<AsyncContext>[][] queues;
+  private final AtomicInteger suspendedCount = new AtomicInteger();
+  private final AtomicInteger[] suspendedByPriority = new 
AtomicInteger[N_PRIORITIES];
+  private final AtomicLong totalSuspended = new AtomicLong();
+  private final AtomicLong totalResumed = new AtomicLong();
+  private final AtomicLong totalExpired = new AtomicLong();
+  private final AtomicLong totalRejected = new AtomicLong();
+
+  private ScheduledExecutorService scheduler;
+  private ScheduledFuture<?> drainTask;
+
+  private LongCounter metricSuspended;
+  private LongCounter metricResumed;
+  private LongCounter metricExpired;
+  private LongCounter metricRejected;
+  private ObservableLongGauge metricCurrentSuspendedGauge;
+
+  // Test seam: when non-null, used in place of {@link #getCores()} so unit 
tests can run
+  // without spinning up a full CoreContainerProvider/servlet context.
+  private CoreContainer testCoreContainer;
+  private boolean skipContainerLookup;
+
+  @SuppressWarnings({"unchecked", "rawtypes"})
+  public SolrQoSFilter() {
+    queues = new Queue[N_PRIORITIES][N_TYPES];
+    for (int p = 0; p < N_PRIORITIES; p++) {
+      suspendedByPriority[p] = new AtomicInteger();
+      for (int t = 0; t < N_TYPES; t++) {
+        queues[p][t] = new ConcurrentLinkedQueue<>();
+      }
+    }
+  }
+
+  private void registerMetrics() {
+    SolrMetricManager mm;
+    try {
+      CoreContainer cc = resolveCoreContainer();
+      mm = (cc == null) ? null : cc.getMetricManager();
+    } catch (UnavailableException e) {
+      log.warn("CoreContainer unavailable at SolrQoSFilter init; metrics not 
registered", e);
+      return;
+    }
+    if (mm == null) {
+      return;
+    }
+    String registry = SolrMetricManager.NODE_REGISTRY;
+    metricSuspended =
+        mm.longCounter(
+            registry,
+            METRIC_PREFIX + "suspended.total",
+            "Requests suspended by SolrQoSFilter waiting for a tripped circuit 
breaker to clear",
+            null);
+    metricResumed =
+        mm.longCounter(
+            registry,
+            METRIC_PREFIX + "resumed.total",
+            "Suspended requests dispatched by SolrQoSFilter once their circuit 
breaker cleared",
+            null);
+    metricExpired =
+        mm.longCounter(
+            registry,
+            METRIC_PREFIX + "expired.total",
+            "Suspended requests rejected by SolrQoSFilter after the suspension 
timeout elapsed",
+            null);
+    metricRejected =
+        mm.longCounter(
+            registry,
+            METRIC_PREFIX + "rejected.total",
+            "Requests rejected fast by SolrQoSFilter because the suspension 
queue was full",
+            null);
+    metricCurrentSuspendedGauge =
+        mm.observableLongGauge(
+            registry,
+            METRIC_PREFIX + "suspended.current",
+            "Requests currently suspended by SolrQoSFilter",
+            measurement -> measurement.record(suspendedCount.get()),
+            null);
+  }
+
+  @Override
+  public void destroy() {
+    if (drainTask != null) {
+      drainTask.cancel(false);
+    }
+    if (scheduler != null) {
+      scheduler.shutdownNow();
+    }
+    if (metricCurrentSuspendedGauge != null) {
+      metricCurrentSuspendedGauge.close();
+    }
+    super.destroy();
+  }
+
+  @Override
+  protected void doFilter(HttpServletRequest req, HttpServletResponse res, 
FilterChain chain)
+      throws IOException, ServletException {
+    if (Boolean.TRUE.equals(req.getAttribute(ATTR_RESUMED))) {
+      // Resumed dispatch from drainQueue — already passed the breaker check; 
let it through.
+      chain.doFilter(req, res);
+      return;
+    }
+
+    if (InternalRequestUtils.isInternalServerRequest(req)) {
+      // Sub-shard requests must not be queued or fail-fast — the parent 
request is blocking on
+      // them and refusing them would cause distributed deadlock.
+      chain.doFilter(req, res);
+      return;
+    }
+
+    SolrRequestType type = inferRequestType(req);
+    CoreContainer cc;
+    try {
+      cc = resolveCoreContainer();
+    } catch (UnavailableException ex) {
+      // Container shutting down or not yet up — pass through.
+      chain.doFilter(req, res);
+      return;
+    }
+    List<CircuitBreaker> tripped = trippedCached(cc, type);
+    if (tripped == null) {
+      chain.doFilter(req, res);
+      return;
+    }
+
+    if (!enabled) {
+      // QoS queueing disabled: preserve the legacy SearchHandler / 
ContentStreamHandlerBase
+      // behavior of failing fast with a 429 when a breaker is tripped.
+      res.sendError(
+          CircuitBreaker.getExceptionErrorCode().code,
+          CircuitBreakerRegistry.toErrorMessage(tripped));
+      return;
+    }
+
+    if (suspendedCount.get() >= maxSuspendedRequests) {
+      rejectQueueFull(res, "Server overloaded; suspended-request queue full");
+      return;
+    }
+
+    int priority = inferPriority(req, type);
+    int typeIdx = typeIndex(type);
+    if (suspendedByPriority[priority].get() >= priorityCap[priority]) {
+      // Lane saturated — refuse fast, but only at this priority. A flood of 
LOW-priority work
+      // cannot starve HIGH-priority probes because each lane has its own cap.
+      rejectQueueFull(res, "Server overloaded; priority lane queue full");
+      return;
+    }
+
+    AsyncContext asyncContext = req.startAsync();
+    asyncContext.setTimeout(suspendTimeoutMs);
+    asyncContext.addListener(new TimeoutListener(priority, typeIdx));
+    suspendedCount.incrementAndGet();
+    suspendedByPriority[priority].incrementAndGet();
+    totalSuspended.incrementAndGet();
+    if (metricSuspended != null) {
+      metricSuspended.add(1);
+    }
+    queues[priority][typeIdx].add(asyncContext);
+    if (log.isDebugEnabled()) {
+      log.debug(
+          "SolrQoSFilter suspended request priority={} type={} ({} now 
suspended)",
+          priority,
+          type,
+          suspendedCount.get());
+    }
+  }
+
+  private void rejectQueueFull(HttpServletResponse res, String message) throws 
IOException {
+    totalRejected.incrementAndGet();
+    if (metricRejected != null) {
+      metricRejected.add(1);
+    }
+    res.sendError(CircuitBreaker.getExceptionErrorCode().code, message);
+  }
+
+  private void drainAll() {
+    try {
+      CoreContainer cc;
+      try {
+        cc = resolveCoreContainer();
+      } catch (UnavailableException ex) {
+        return;
+      }
+      for (int p = N_PRIORITIES - 1; p >= 0; p--) {
+        for (int t = 0; t < N_TYPES; t++) {
+          drainQueue(queues[p][t], p, typeFromIndex(t), cc, 
priorityDrainBudget[p]);
+        }
+      }
+    } catch (Throwable t) {
+      log.error("SolrQoSFilter drain failed", t);
+    }
+  }
+
+  private void drainQueue(
+      Queue<AsyncContext> q, int priority, SolrRequestType type, CoreContainer 
cc, int budgetCap) {
+    if (q.isEmpty()) {
+      return;
+    }
+    List<CircuitBreaker> tripped = trippedCached(cc, type);
+    if (tripped != null) {
+      return;
+    }
+    int budget = budgetCap;
+    while (budget-- > 0) {
+      AsyncContext popped = q.poll();
+      if (popped == null) {
+        return;
+      }
+      try {
+        HttpServletRequest req = (HttpServletRequest) popped.getRequest();
+        req.setAttribute(ATTR_RESUMED, Boolean.TRUE);
+        popped.dispatch();
+        suspendedCount.decrementAndGet();
+        suspendedByPriority[priority].decrementAndGet();
+        totalResumed.incrementAndGet();
+        if (metricResumed != null) {
+          metricResumed.add(1);
+        }
+      } catch (IllegalStateException ise) {
+        // Async context already terminal (timed out or errored). The 
TimeoutListener already
+        // decremented both counters; just discard the dead entry.
+        if (log.isDebugEnabled()) {
+          log.debug("SolrQoSFilter dispatch raced with completion", ise);
+        }
+      }
+    }
+  }
+
+  /**
+   * Determine request type from the {@code Solr-Request-Type} header, falling 
back to a crude path
+   * heuristic. Anything not recognized as UPDATE is treated as QUERY for 
circuit-breaker
+   * accounting.
+   *
+   * <p>The path fallback only matches paths containing {@code /update}; 
everything else (including
+   * admin endpoints such as {@code /solr/admin/collections}, security APIs, 
and core admin actions)
+   * is classified as QUERY. The current circuit-breaker registry only 
differentiates QUERY from
+   * UPDATE breakers, so this coarseness is acceptable, but if a future 
breaker type is added (e.g.
+   * ADMIN) this heuristic will need refinement.
+   */
+  static SolrRequestType inferRequestType(HttpServletRequest req) {

Review Comment:
   `HttpSolrCall` also does this kind of work right? Can we have the logic 
combined or look similar?



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.management.GarbageCollectorMXBean;
+import java.lang.management.ManagementFactory;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.solr.common.util.EnvUtils;
+
+/**
+ * Trips when the JVM is spending more than a configured percentage of 
wall-clock time in garbage
+ * collection over a sliding window.
+ *
+ * <p>Complementary to {@link MemoryCircuitBreaker}: that breaker fires when 
post-GC live data is
+ * exhausting the heap; this one fires when GC is keeping up (live data may 
even be small) but
+ * consuming so much CPU that the application is starving. Both conditions 
usually precede an OOM
+ * but each catches the other's blind spot.
+ *
+ * <p>The percentage is computed as
+ *
+ * <pre>{@code
+ * sum(GarbageCollectorMXBean.getCollectionTime()) over window
+ *   ─────────────────────────────────────────────────────────  × 100
+ *                       wall-clock window
+ * }</pre>
+ *
+ * <p>The window length defaults to {@link #DEFAULT_WINDOW_SECONDS} seconds 
and is configurable via
+ * {@value #SYSPROP_WINDOW_SECONDS}. Within one window the breaker reports the 
GC ratio over the
+ * time elapsed since the anchor sample; once the window length is exceeded, 
the anchor slides
+ * forward to the most recent sample so the ratio reflects only recent 
behavior.
+ *
+ * <p>{@link #isTripped()} is rate-limited by the same {@code 
TtlSampledMetric} used by the
+ * load-average and CPU breakers ({@value 
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}), so
+ * concurrent admission-control callers share one ratio computation per cache 
window.
+ */
+public class GcOverheadCircuitBreaker extends CircuitBreaker {
+
+  public static final String SYSPROP_WINDOW_SECONDS =
+      CircuitBreakerRegistry.SYSPROP_PREFIX + "gcoverhead.windowSeconds";
+  public static final long DEFAULT_WINDOW_SECONDS = 30L;
+
+  private static final List<GarbageCollectorMXBean> GC_BEANS =
+      List.copyOf(ManagementFactory.getGarbageCollectorMXBeans());
+
+  private double overheadThresholdPercent;
+  private final long windowNanos;
+  private final TtlSampledMetric ratioCache;
+  private final AtomicReference<Sample> anchor = new AtomicReference<>();
+
+  private static final ThreadLocal<Double> seenOverheadPercent = 
ThreadLocal.withInitial(() -> 0.0);
+  private static final ThreadLocal<Double> allowedOverheadPercent =
+      ThreadLocal.withInitial(() -> 0.0);
+
+  public GcOverheadCircuitBreaker() {
+    super();
+    long windowSeconds = EnvUtils.getPropertyAsLong(SYSPROP_WINDOW_SECONDS, 
DEFAULT_WINDOW_SECONDS);
+    if (windowSeconds <= 0) {
+      throw new IllegalArgumentException(
+          "Invalid GC overhead window: " + windowSeconds + "s (must be 
positive)");
+    }
+    this.windowNanos = TimeUnit.SECONDS.toNanos(windowSeconds);
+    this.ratioCache =
+        new TtlSampledMetric(
+            EnvUtils.getPropertyAsLong(
+                CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS,
+                CircuitBreakerRegistry.DEFAULT_SAMPLE_TTL_MS));
+  }
+
+  public GcOverheadCircuitBreaker setThreshold(double 
thresholdValueInPercentage) {
+    if (thresholdValueInPercentage <= 0 || thresholdValueInPercentage > 100) {
+      throw new IllegalArgumentException(
+          "GC overhead threshold must be in (0, 100]; got " + 
thresholdValueInPercentage);
+    }
+    this.overheadThresholdPercent = thresholdValueInPercentage;
+    return this;
+  }
+
+  public double getOverheadThresholdPercent() {
+    return overheadThresholdPercent;
+  }
+
+  @Override
+  public boolean isTripped() {
+    double localAllowed = overheadThresholdPercent;
+    double localSeen = ratioCache.get(this::computeGcOverheadPercent);
+
+    allowedOverheadPercent.set(localAllowed);
+    seenOverheadPercent.set(localSeen);
+
+    return localSeen >= localAllowed;

Review Comment:
   Ok, I see this is also done in the memoryCircuitBreaker.



##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/TtlSampledMetric.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.DoubleSupplier;
+
+/**
+ * Time-bounded cache around an expensive double-valued metric sample. 
Concurrent readers within one
+ * TTL window share a single computed value; on cache miss, multiple threads 
may each compute before
+ * the latest store wins, which is bounded and acceptable.
+ *
+ * <p>Used by {@link LoadAverageCircuitBreaker} and {@link CPUCircuitBreaker} 
so that high-QPS
+ * admission control does not poll OS load-average syscalls or Prometheus 
metric scans more often
+ * than the underlying signals can move.
+ */
+final class TtlSampledMetric {
+
+  private final long ttlNanos;
+  private final AtomicReference<Sample> sample = new AtomicReference<>();
+
+  TtlSampledMetric(long ttlMs) {
+    this.ttlNanos = TimeUnit.MILLISECONDS.toNanos(ttlMs);
+  }
+
+  double get(DoubleSupplier source) {

Review Comment:
   And yeah the class should be typed so that the MemoryCircuitBreaker can just 
use a long instead of casting the double.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to