HoustonPutman commented on code in PR #4432:
URL: https://github.com/apache/solr/pull/4432#discussion_r3351062300
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerRegistry.java:
##########
@@ -210,6 +226,85 @@ public List<CircuitBreaker> checkTripped(SolrRequestType
requestType) {
return triggeredCircuitBreakers;
}
+ /**
+ * Check globally-registered (process-wide) circuit breakers for the given
request type. Warn-only
+ * breakers are excluded from the result so that callers performing
admission control do not act
+ * on them.
+ *
+ * <p>Filter-level callers (e.g. {@link
org.apache.solr.servlet.SolrQoSFilter}) do not have
+ * per-core context and therefore consult only the static global map.
+ *
+ * @return null if no global breakers are registered for {@code requestType}
or none are tripped;
+ * otherwise a non-empty list of tripped breakers
+ */
+ public static List<CircuitBreaker> checkTrippedGlobal(SolrRequestType
requestType) {
+ return selectTripped(globalCircuitBreakerMap.get(requestType));
+ }
+
+ /**
+ * Per-core variant of {@link #checkTrippedGlobal(SolrRequestType)} that
consults only this
+ * registry's local breakers. Warn-only breakers are excluded.
+ */
+ public List<CircuitBreaker> checkTrippedLocal(SolrRequestType requestType) {
+ List<CircuitBreaker> snapshot;
+ synchronized (circuitBreakerMap) {
+ List<CircuitBreaker> breakersOfType = circuitBreakerMap.get(requestType);
Review Comment:
Why not just change `circuitBreakerMap` to be a `ConcurrentHashMap`?
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerRegistry.java:
##########
@@ -210,6 +226,85 @@ public List<CircuitBreaker> checkTripped(SolrRequestType
requestType) {
return triggeredCircuitBreakers;
}
+ /**
+ * Check globally-registered (process-wide) circuit breakers for the given
request type. Warn-only
+ * breakers are excluded from the result so that callers performing
admission control do not act
+ * on them.
+ *
+ * <p>Filter-level callers (e.g. {@link
org.apache.solr.servlet.SolrQoSFilter}) do not have
+ * per-core context and therefore consult only the static global map.
+ *
+ * @return null if no global breakers are registered for {@code requestType}
or none are tripped;
+ * otherwise a non-empty list of tripped breakers
+ */
+ public static List<CircuitBreaker> checkTrippedGlobal(SolrRequestType
requestType) {
+ return selectTripped(globalCircuitBreakerMap.get(requestType));
+ }
+
+ /**
+ * Per-core variant of {@link #checkTrippedGlobal(SolrRequestType)} that
consults only this
+ * registry's local breakers. Warn-only breakers are excluded.
+ */
+ public List<CircuitBreaker> checkTrippedLocal(SolrRequestType requestType) {
+ List<CircuitBreaker> snapshot;
+ synchronized (circuitBreakerMap) {
+ List<CircuitBreaker> breakersOfType = circuitBreakerMap.get(requestType);
+ snapshot = (breakersOfType == null) ? null : new
ArrayList<>(breakersOfType);
+ }
+ return selectTripped(snapshot);
+ }
+
+ private static List<CircuitBreaker> selectTripped(List<CircuitBreaker>
breakers) {
+ if (breakers == null || breakers.isEmpty()) {
+ return null;
+ }
+ List<CircuitBreaker> tripped = null;
+ for (CircuitBreaker cb : breakers) {
+ if (cb.isTripped() && !cb.isWarnOnly()) {
+ if (tripped == null) {
+ tripped = new ArrayList<>();
+ }
+ tripped.add(cb);
+ }
+ }
+ return tripped;
+ }
+
+ /**
+ * Filter-level admission control helper: combine globally-registered and
every
+ * per-core-registered circuit breaker for {@code requestType}. Returns null
if nothing is
+ * tripped, otherwise a non-empty list of tripped breakers.
+ *
+ * <p>Iteration is over all cores in {@code cc}. If any core has a tripped
breaker for the type,
+ * the request is treated as tripped cluster-wide; this is intentionally
conservative since the
+ * filter does not yet know which core a request will resolve to.
Review Comment:
Doesn't this make the core-scoped CBs partially global at this point? If the
filter doesn't know yet which core a request will resolve to, why doesn't it
wait to check the CBs then?
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.management.GarbageCollectorMXBean;
+import java.lang.management.ManagementFactory;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.solr.common.util.EnvUtils;
+
+/**
+ * Trips when the JVM is spending more than a configured percentage of
wall-clock time in garbage
+ * collection over a sliding window.
+ *
+ * <p>Complementary to {@link MemoryCircuitBreaker}: that breaker fires when
post-GC live data is
+ * exhausting the heap; this one fires when GC is keeping up (live data may
even be small) but
+ * consuming so much CPU that the application is starving. Both conditions
usually precede an OOM
+ * but each catches the other's blind spot.
+ *
+ * <p>The percentage is computed as
+ *
+ * <pre>{@code
+ * sum(GarbageCollectorMXBean.getCollectionTime()) over window
+ * ───────────────────────────────────────────────────────── × 100
+ * wall-clock window
+ * }</pre>
+ *
+ * <p>The window length defaults to {@link #DEFAULT_WINDOW_SECONDS} seconds
and is configurable via
+ * {@value #SYSPROP_WINDOW_SECONDS}. Within one window the breaker reports the
GC ratio over the
+ * time elapsed since the anchor sample; once the window length is exceeded,
the anchor slides
+ * forward to the most recent sample so the ratio reflects only recent
behavior.
+ *
+ * <p>{@link #isTripped()} is rate-limited by the same {@code
TtlSampledMetric} used by the
+ * load-average and CPU breakers ({@value
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}), so
+ * concurrent admission-control callers share one ratio computation per cache
window.
+ */
+public class GcOverheadCircuitBreaker extends CircuitBreaker {
+
+ public static final String SYSPROP_WINDOW_SECONDS =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "gcoverhead.windowSeconds";
+ public static final long DEFAULT_WINDOW_SECONDS = 30L;
+
+ private static final List<GarbageCollectorMXBean> GC_BEANS =
+ List.copyOf(ManagementFactory.getGarbageCollectorMXBeans());
+
+ private double overheadThresholdPercent;
+ private final long windowNanos;
+ private final TtlSampledMetric ratioCache;
+ private final AtomicReference<Sample> anchor = new AtomicReference<>();
+
+ private static final ThreadLocal<Double> seenOverheadPercent =
ThreadLocal.withInitial(() -> 0.0);
+ private static final ThreadLocal<Double> allowedOverheadPercent =
+ ThreadLocal.withInitial(() -> 0.0);
+
+ public GcOverheadCircuitBreaker() {
+ super();
Review Comment:
```suggestion
```
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
Review Comment:
So I like where this is going, but I'm afraid it's a little heavy-handed.
Right now we will basically stop all requests for 30s if GC is a lot. and then
see after 30s if it has gone down. Instead should we have a poll rate and use a
Dequeue to keep track of a rolling window of GC times (and every poll time we
subtract the first entry and add the last entry to our numbers)? That way we
can say check every 100ms, but still use a rolling window of 30s to get a rate.
##########
solr/core/src/java/org/apache/solr/servlet/SolrQoSFilter.java:
##########
@@ -0,0 +1,733 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import com.google.common.annotations.VisibleForTesting;
+import io.opentelemetry.api.metrics.LongCounter;
+import io.opentelemetry.api.metrics.ObservableLongGauge;
+import jakarta.servlet.AsyncContext;
+import jakarta.servlet.AsyncEvent;
+import jakarta.servlet.AsyncListener;
+import jakarta.servlet.FilterChain;
+import jakarta.servlet.FilterConfig;
+import jakarta.servlet.ServletException;
+import jakarta.servlet.UnavailableException;
+import jakarta.servlet.http.HttpServletRequest;
+import jakarta.servlet.http.HttpServletResponse;
+import java.io.IOException;
+import java.lang.invoke.MethodHandles;
+import java.util.List;
+import java.util.Locale;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.ScheduledFuture;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.solr.client.solrj.SolrRequest.SolrRequestType;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.util.EnvUtils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.metrics.SolrMetricManager;
+import org.apache.solr.util.circuitbreaker.CircuitBreaker;
+import org.apache.solr.util.circuitbreaker.CircuitBreakerRegistry;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Quality-of-Service filter that asynchronously suspends requests when
circuit breakers (global or
+ * per-core) are tripped, instead of failing fast with a {@code 429}.
Suspended requests are queued
+ * by priority (and request type) and resumed by a periodic scheduler that
re-checks the {@link
+ * CircuitBreakerRegistry}; once no breaker is tripped for a request's type,
the request is
+ * dispatched to the rest of the chain.
+ *
+ * <p>Designed after Jetty's {@code QoSFilter}/{@code QoSHandler}: instead of
dropping load during
+ * transient spikes (e.g. GC pauses), the filter buffers requests, prioritizes
intra-cluster shard /
+ * admin / health-check traffic over user queries over background updates, and
shed-loads only when
+ * the suspension queue saturates or a request's suspension timeout elapses.
+ *
+ * <p>This filter always synchronously checks circuit breakers on the request
path. When QoS
+ * queueing is enabled (the default; toggle via {@value
#SYSPROP_QOS_ENABLED}), a tripped breaker
+ * causes the request to be suspended (via {@link
HttpServletRequest#startAsync()}) and queued by
+ * priority + type until the breaker clears. When QoS queueing is disabled, a
tripped breaker causes
+ * a synchronous fail-fast {@code 429} — preserving the legacy {@code
SearchHandler} / {@code
+ * ContentStreamHandlerBase} behavior that previously enforced the same check
inside the handlers.
+ *
+ * <p>Internal intra-cluster shard requests (those carrying {@code
Solr-Request-Context: SERVER})
+ * always bypass both the breaker check and suspension to avoid distributed
deadlock — sub-shard
+ * requests cannot be queued waiting on a parent that is itself waiting.
+ *
+ * <p>All filters and the SolrServlet on the request path must declare {@code
+ * <async-supported>true</async-supported>}; otherwise {@link
HttpServletRequest#startAsync()} will
+ * throw at runtime when a circuit breaker first trips.
+ *
+ * @lucene.experimental
+ */
+public class SolrQoSFilter extends CoreContainerAwareHttpFilter {
Review Comment:
This is a lot. Would it be possible to have The Solr RequestHandlerBase
extend Jetty's `QoSHandler`? And we could have a reference to a shared class
that keeps track of the limits together? Just don't want to essentially fork
the Jetty QoS code.
I mean there's the `QoSFilter` we could possibly use too, but I'm not sure
it would be as easy to hook up. And its deprecated.
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/TtlSampledMetric.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.DoubleSupplier;
+
+/**
+ * Time-bounded cache around an expensive double-valued metric sample.
Concurrent readers within one
+ * TTL window share a single computed value; on cache miss, multiple threads
may each compute before
+ * the latest store wins, which is bounded and acceptable.
+ *
+ * <p>Used by {@link LoadAverageCircuitBreaker} and {@link CPUCircuitBreaker}
so that high-QPS
+ * admission control does not poll OS load-average syscalls or Prometheus
metric scans more often
+ * than the underlying signals can move.
+ */
+final class TtlSampledMetric {
+
+ private final long ttlNanos;
+ private final AtomicReference<Sample> sample = new AtomicReference<>();
+
+ TtlSampledMetric(long ttlMs) {
+ this.ttlNanos = TimeUnit.MILLISECONDS.toNanos(ttlMs);
+ }
+
+ double get(DoubleSupplier source) {
Review Comment:
Might make more sense for the `DoubleSupplier` (or generic `Supplier<T>`) to
be provided in the constructor...
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java:
##########
@@ -96,19 +86,65 @@ public MemoryCircuitBreaker setThreshold(double
thresholdValueInPercentage) {
@Override
public boolean isTripped() {
-
- long localAllowedMemory = getCurrentMemoryThreshold();
+ long localAllowedMemory = heapMemoryThreshold;
long localSeenMemory = getAvgMemoryUsage();
allowedMemory.set(localAllowedMemory);
-
seenMemory.set(localSeenMemory);
- return (localSeenMemory >= localAllowedMemory);
+ return localSeenMemory >= localAllowedMemory;
}
+ /**
+ * Returns post-GC live bytes in the old-gen pool (or the sum across all
heap pools when no
+ * dedicated old-gen pool exists). Cached for {@link
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}
+ * so high-QPS callers don't repeatedly walk {@link MemoryPoolMXBean} list.
+ *
+ * <p>The historical name is preserved for source-compatibility with
subclasses that override the
+ * value source for testing; the implementation no longer averages anything.
+ */
protected long getAvgMemoryUsage() {
- return (long) averagingMetricProvider.get().getMetricValue();
+ return (long)
heapLiveCache.get(MemoryCircuitBreaker::samplePostGcLiveBytes);
Review Comment:
So the other CB classes will use the cached value read in `isTripped()` and
leave `getAvgMemoryUsage()` to be where the logic lives to calculate. Should
`getAvgMemoryUsage()` just have the logic of `samplePostGcLiveBytes()` and then
`isTripped()` can call `heapLiveCache.get(this::getAvgMemoryUsage)`?
##########
solr/core/src/java/org/apache/solr/servlet/InternalRequestUtils.java:
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import jakarta.servlet.http.HttpServletRequest;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.common.params.CommonParams;
+
+/**
+ * Shared detection for internal intra-cluster server-to-server requests.
Filter-tier admission
+ * control (e.g. {@link SolrQoSFilter}, {@link RateLimitManager}) consults
this so sub-shard
+ * requests are unconditionally admitted — refusing them would risk
distributed deadlock, since the
+ * parent request is already blocking on the sub-shard.
+ *
+ * <p>Mirrors {@code RequestHandlerBase.isInternalShardRequest} at the HTTP
layer so the filter-tier
+ * definition stays consistent with the handler-tier one.
+ */
+final class InternalRequestUtils {
+
+ private InternalRequestUtils() {}
+
+ /**
+ * Whether {@code req} is an internal intra-cluster shard / admin request.
Identified by any of:
+ *
+ * <ul>
+ * <li>{@code Solr-Request-Context: SERVER} header — set by SolrJ on
server-to-server traffic.
+ * <li>{@code isShard=true} query parameter — set by {@code ShardHandler}
on sub-shard fan-out.
+ * <li>{@code distrib.from=…} query parameter — set by {@code
DistributedUpdateProcessor} when
+ * forwarding update requests between replicas.
+ * </ul>
+ */
+ static boolean isInternalServerRequest(HttpServletRequest req) {
Review Comment:
Isn't this something that an attacker could use to bypass the rate limiters?
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/CircuitBreakerRegistry.java:
##########
@@ -52,16 +52,27 @@
public class CircuitBreakerRegistry implements Closeable {
private static final Logger log =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
private static final Pattern SYSPROP_REGEX =
-
Pattern.compile("solr.circuitbreaker\\.(update|query)\\.(cpu|mem|loadavg)");
+
Pattern.compile("solr.circuitbreaker\\.(update|query)\\.(cpu|mem|loadavg|gcoverhead)");
public static final String SYSPROP_PREFIX = "solr.circuitbreaker.";
public static final String SYSPROP_UPDATE_CPU = SYSPROP_PREFIX +
"update.cpu";
public static final String SYSPROP_UPDATE_MEM = SYSPROP_PREFIX +
"update.mem";
public static final String SYSPROP_UPDATE_LOADAVG = SYSPROP_PREFIX +
"update.loadavg";
+ public static final String SYSPROP_UPDATE_GCOVERHEAD = SYSPROP_PREFIX +
"update.gcoverhead";
public static final String SYSPROP_QUERY_CPU = SYSPROP_PREFIX + "query.cpu";
public static final String SYSPROP_QUERY_MEM = SYSPROP_PREFIX + "query.mem";
public static final String SYSPROP_QUERY_LOADAVG = SYSPROP_PREFIX +
"query.loadavg";
+ public static final String SYSPROP_QUERY_GCOVERHEAD = SYSPROP_PREFIX +
"query.gcoverhead";
public static final String SYSPROP_WARN_ONLY_SUFFIX = ".warnonly";
+ /**
+ * Default TTL (ms) of cached load-average / CPU samples consulted by {@link
+ * LoadAverageCircuitBreaker} and {@link CPUCircuitBreaker}. Override
per-process via the {@value
+ * #SYSPROP_SAMPLE_TTL_MS} system property.
+ */
+ public static final long DEFAULT_SAMPLE_TTL_MS = 1000L;
Review Comment:
I would probably change to 100ms?
##########
solr/core/src/java/org/apache/solr/servlet/InternalRequestUtils.java:
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import jakarta.servlet.http.HttpServletRequest;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.common.params.CommonParams;
+
+/**
+ * Shared detection for internal intra-cluster server-to-server requests.
Filter-tier admission
+ * control (e.g. {@link SolrQoSFilter}, {@link RateLimitManager}) consults
this so sub-shard
+ * requests are unconditionally admitted — refusing them would risk
distributed deadlock, since the
+ * parent request is already blocking on the sub-shard.
+ *
+ * <p>Mirrors {@code RequestHandlerBase.isInternalShardRequest} at the HTTP
layer so the filter-tier
+ * definition stays consistent with the handler-tier one.
+ */
+final class InternalRequestUtils {
+
+ private InternalRequestUtils() {}
+
+ /**
+ * Whether {@code req} is an internal intra-cluster shard / admin request.
Identified by any of:
+ *
+ * <ul>
+ * <li>{@code Solr-Request-Context: SERVER} header — set by SolrJ on
server-to-server traffic.
+ * <li>{@code isShard=true} query parameter — set by {@code ShardHandler}
on sub-shard fan-out.
+ * <li>{@code distrib.from=…} query parameter — set by {@code
DistributedUpdateProcessor} when
+ * forwarding update requests between replicas.
+ * </ul>
+ */
+ static boolean isInternalServerRequest(HttpServletRequest req) {
Review Comment:
Ok I see this isn't something new. You are just restructuring.
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.management.GarbageCollectorMXBean;
+import java.lang.management.ManagementFactory;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.solr.common.util.EnvUtils;
+
+/**
+ * Trips when the JVM is spending more than a configured percentage of
wall-clock time in garbage
+ * collection over a sliding window.
+ *
+ * <p>Complementary to {@link MemoryCircuitBreaker}: that breaker fires when
post-GC live data is
+ * exhausting the heap; this one fires when GC is keeping up (live data may
even be small) but
+ * consuming so much CPU that the application is starving. Both conditions
usually precede an OOM
+ * but each catches the other's blind spot.
+ *
+ * <p>The percentage is computed as
+ *
+ * <pre>{@code
+ * sum(GarbageCollectorMXBean.getCollectionTime()) over window
+ * ───────────────────────────────────────────────────────── × 100
+ * wall-clock window
+ * }</pre>
+ *
+ * <p>The window length defaults to {@link #DEFAULT_WINDOW_SECONDS} seconds
and is configurable via
+ * {@value #SYSPROP_WINDOW_SECONDS}. Within one window the breaker reports the
GC ratio over the
+ * time elapsed since the anchor sample; once the window length is exceeded,
the anchor slides
+ * forward to the most recent sample so the ratio reflects only recent
behavior.
+ *
+ * <p>{@link #isTripped()} is rate-limited by the same {@code
TtlSampledMetric} used by the
+ * load-average and CPU breakers ({@value
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}), so
+ * concurrent admission-control callers share one ratio computation per cache
window.
+ */
+public class GcOverheadCircuitBreaker extends CircuitBreaker {
+
+ public static final String SYSPROP_WINDOW_SECONDS =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "gcoverhead.windowSeconds";
+ public static final long DEFAULT_WINDOW_SECONDS = 30L;
+
+ private static final List<GarbageCollectorMXBean> GC_BEANS =
+ List.copyOf(ManagementFactory.getGarbageCollectorMXBeans());
+
+ private double overheadThresholdPercent;
+ private final long windowNanos;
+ private final TtlSampledMetric ratioCache;
+ private final AtomicReference<Sample> anchor = new AtomicReference<>();
+
+ private static final ThreadLocal<Double> seenOverheadPercent =
ThreadLocal.withInitial(() -> 0.0);
+ private static final ThreadLocal<Double> allowedOverheadPercent =
+ ThreadLocal.withInitial(() -> 0.0);
+
+ public GcOverheadCircuitBreaker() {
+ super();
+ long windowSeconds = EnvUtils.getPropertyAsLong(SYSPROP_WINDOW_SECONDS,
DEFAULT_WINDOW_SECONDS);
+ if (windowSeconds <= 0) {
+ throw new IllegalArgumentException(
+ "Invalid GC overhead window: " + windowSeconds + "s (must be
positive)");
+ }
+ this.windowNanos = TimeUnit.SECONDS.toNanos(windowSeconds);
+ this.ratioCache =
+ new TtlSampledMetric(
+ EnvUtils.getPropertyAsLong(
+ CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS,
+ CircuitBreakerRegistry.DEFAULT_SAMPLE_TTL_MS));
+ }
+
+ public GcOverheadCircuitBreaker setThreshold(double
thresholdValueInPercentage) {
+ if (thresholdValueInPercentage <= 0 || thresholdValueInPercentage > 100) {
+ throw new IllegalArgumentException(
+ "GC overhead threshold must be in (0, 100]; got " +
thresholdValueInPercentage);
+ }
+ this.overheadThresholdPercent = thresholdValueInPercentage;
+ return this;
+ }
+
+ public double getOverheadThresholdPercent() {
+ return overheadThresholdPercent;
+ }
+
+ @Override
+ public boolean isTripped() {
+ double localAllowed = overheadThresholdPercent;
+ double localSeen = ratioCache.get(this::computeGcOverheadPercent);
+
+ allowedOverheadPercent.set(localAllowed);
+ seenOverheadPercent.set(localSeen);
+
+ return localSeen >= localAllowed;
Review Comment:
If it's called `allowed` then shouldn't we be checking that `localSeen >
localAllowed`? Its weird that `seen == allowed` would be an error.
##########
solr/core/src/java/org/apache/solr/servlet/RateLimitManager.java:
##########
Review Comment:
What's the point of this class now that `SolrQoSFilter` exists? Is it
back-compat? Is it just a lighter weight version?
##########
solr/core/src/java/org/apache/solr/servlet/SolrQoSFilter.java:
##########
@@ -0,0 +1,733 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.servlet;
+
+import com.google.common.annotations.VisibleForTesting;
+import io.opentelemetry.api.metrics.LongCounter;
+import io.opentelemetry.api.metrics.ObservableLongGauge;
+import jakarta.servlet.AsyncContext;
+import jakarta.servlet.AsyncEvent;
+import jakarta.servlet.AsyncListener;
+import jakarta.servlet.FilterChain;
+import jakarta.servlet.FilterConfig;
+import jakarta.servlet.ServletException;
+import jakarta.servlet.UnavailableException;
+import jakarta.servlet.http.HttpServletRequest;
+import jakarta.servlet.http.HttpServletResponse;
+import java.io.IOException;
+import java.lang.invoke.MethodHandles;
+import java.util.List;
+import java.util.Locale;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.ScheduledFuture;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.solr.client.solrj.SolrRequest.SolrRequestType;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.util.EnvUtils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.metrics.SolrMetricManager;
+import org.apache.solr.util.circuitbreaker.CircuitBreaker;
+import org.apache.solr.util.circuitbreaker.CircuitBreakerRegistry;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Quality-of-Service filter that asynchronously suspends requests when
circuit breakers (global or
+ * per-core) are tripped, instead of failing fast with a {@code 429}.
Suspended requests are queued
+ * by priority (and request type) and resumed by a periodic scheduler that
re-checks the {@link
+ * CircuitBreakerRegistry}; once no breaker is tripped for a request's type,
the request is
+ * dispatched to the rest of the chain.
+ *
+ * <p>Designed after Jetty's {@code QoSFilter}/{@code QoSHandler}: instead of
dropping load during
+ * transient spikes (e.g. GC pauses), the filter buffers requests, prioritizes
intra-cluster shard /
+ * admin / health-check traffic over user queries over background updates, and
shed-loads only when
+ * the suspension queue saturates or a request's suspension timeout elapses.
+ *
+ * <p>This filter always synchronously checks circuit breakers on the request
path. When QoS
+ * queueing is enabled (the default; toggle via {@value
#SYSPROP_QOS_ENABLED}), a tripped breaker
+ * causes the request to be suspended (via {@link
HttpServletRequest#startAsync()}) and queued by
+ * priority + type until the breaker clears. When QoS queueing is disabled, a
tripped breaker causes
+ * a synchronous fail-fast {@code 429} — preserving the legacy {@code
SearchHandler} / {@code
+ * ContentStreamHandlerBase} behavior that previously enforced the same check
inside the handlers.
+ *
+ * <p>Internal intra-cluster shard requests (those carrying {@code
Solr-Request-Context: SERVER})
+ * always bypass both the breaker check and suspension to avoid distributed
deadlock — sub-shard
+ * requests cannot be queued waiting on a parent that is itself waiting.
+ *
+ * <p>All filters and the SolrServlet on the request path must declare {@code
+ * <async-supported>true</async-supported>}; otherwise {@link
HttpServletRequest#startAsync()} will
+ * throw at runtime when a circuit breaker first trips.
+ *
+ * @lucene.experimental
+ */
+public class SolrQoSFilter extends CoreContainerAwareHttpFilter {
+ private static final Logger log =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
+ public static final String SYSPROP_QOS_ENABLED =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.enabled";
+ public static final String SYSPROP_MAX_SUSPENDED =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.maxSuspendedRequests";
+ public static final String SYSPROP_SUSPEND_TIMEOUT_MS =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.suspendTimeoutMs";
+ public static final String SYSPROP_CHECK_INTERVAL_MS =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.checkIntervalMs";
+ public static final String SYSPROP_EVAL_INTERVAL_MS =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.evaluationIntervalMs";
+ public static final String SYSPROP_DRAIN_BUDGET =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.drainBudget";
+ public static final String SYSPROP_HIGH_SHARE =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.priority.high.share";
+ public static final String SYSPROP_MEDIUM_SHARE =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.priority.medium.share";
+ public static final String SYSPROP_HIGH_DRAIN_MULTIPLIER =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.drainBudget.highMultiplier";
+ public static final String SYSPROP_LOW_DRAIN_MULTIPLIER =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "qos.drainBudget.lowMultiplier";
+
+ /**
+ * Optional client-supplied priority hint. Values are case-insensitive
{@code HIGH}, {@code
+ * MEDIUM}, or {@code LOW}. Unknown values are ignored and the filter falls
back to its path /
+ * type heuristic. Trusted callers (internal admin tooling, well-behaved
SolrJ extensions) may use
+ * this to override the default lane assignment without the filter having to
inspect the request
+ * body.
+ */
+ public static final String HEADER_REQUEST_PRIORITY = "Solr-Request-Priority";
+
+ static final int DEFAULT_MAX_SUSPENDED_REQUESTS = 1024;
+ static final long DEFAULT_SUSPEND_TIMEOUT_MS = 5000L;
+ static final long DEFAULT_CHECK_INTERVAL_MS = 100L;
+ static final long DEFAULT_EVAL_INTERVAL_MS = 200L;
+ static final int DEFAULT_DRAIN_BUDGET = 200;
+ static final double DEFAULT_HIGH_SHARE = 0.10;
+ static final double DEFAULT_MEDIUM_SHARE = 0.60;
+ static final double DEFAULT_HIGH_DRAIN_MULTIPLIER = 4.0;
+ static final double DEFAULT_LOW_DRAIN_MULTIPLIER = 0.5;
+
+ static final int PRIORITY_HIGH = 2; // admin/ping/health-check
+ static final int PRIORITY_MEDIUM = 1; // user QUERY
+ static final int PRIORITY_LOW = 0; // UPDATE
+ static final int N_PRIORITIES = 3;
+
+ static final int TYPE_QUERY = 0;
+ static final int TYPE_UPDATE = 1;
+ static final int N_TYPES = 2;
+
+ private static final String ATTR_RESUMED = SolrQoSFilter.class.getName() +
".resumed";
+
+ private static final String METRIC_PREFIX = "qos.";
+
+ private boolean enabled;
+ private int maxSuspendedRequests;
+ private long suspendTimeoutMs;
+ private long checkIntervalMs;
+ private long evaluationIntervalNanos;
+ private int drainBudget;
+ private final int[] priorityCap = new int[N_PRIORITIES];
+ private final int[] priorityDrainBudget = new int[N_PRIORITIES];
+
+ // Cache of the last circuit-breaker scan per request type. Several
admission-control checks
+ // (load average, CPU) are expensive and sourced from metrics that don't
update faster than
+ // ~once per second anyway, so polling them per-request scales badly under
load. We share
+ // one evaluation across all callers within evaluationIntervalNanos.
+ private volatile TrippedScan queryScan;
+ private volatile TrippedScan updateScan;
+
+ private final Queue<AsyncContext>[][] queues;
+ private final AtomicInteger suspendedCount = new AtomicInteger();
+ private final AtomicInteger[] suspendedByPriority = new
AtomicInteger[N_PRIORITIES];
+ private final AtomicLong totalSuspended = new AtomicLong();
+ private final AtomicLong totalResumed = new AtomicLong();
+ private final AtomicLong totalExpired = new AtomicLong();
+ private final AtomicLong totalRejected = new AtomicLong();
+
+ private ScheduledExecutorService scheduler;
+ private ScheduledFuture<?> drainTask;
+
+ private LongCounter metricSuspended;
+ private LongCounter metricResumed;
+ private LongCounter metricExpired;
+ private LongCounter metricRejected;
+ private ObservableLongGauge metricCurrentSuspendedGauge;
+
+ // Test seam: when non-null, used in place of {@link #getCores()} so unit
tests can run
+ // without spinning up a full CoreContainerProvider/servlet context.
+ private CoreContainer testCoreContainer;
+ private boolean skipContainerLookup;
+
+ @SuppressWarnings({"unchecked", "rawtypes"})
+ public SolrQoSFilter() {
+ queues = new Queue[N_PRIORITIES][N_TYPES];
+ for (int p = 0; p < N_PRIORITIES; p++) {
+ suspendedByPriority[p] = new AtomicInteger();
+ for (int t = 0; t < N_TYPES; t++) {
+ queues[p][t] = new ConcurrentLinkedQueue<>();
+ }
+ }
+ }
+
+ private void registerMetrics() {
+ SolrMetricManager mm;
+ try {
+ CoreContainer cc = resolveCoreContainer();
+ mm = (cc == null) ? null : cc.getMetricManager();
+ } catch (UnavailableException e) {
+ log.warn("CoreContainer unavailable at SolrQoSFilter init; metrics not
registered", e);
+ return;
+ }
+ if (mm == null) {
+ return;
+ }
+ String registry = SolrMetricManager.NODE_REGISTRY;
+ metricSuspended =
+ mm.longCounter(
+ registry,
+ METRIC_PREFIX + "suspended.total",
+ "Requests suspended by SolrQoSFilter waiting for a tripped circuit
breaker to clear",
+ null);
+ metricResumed =
+ mm.longCounter(
+ registry,
+ METRIC_PREFIX + "resumed.total",
+ "Suspended requests dispatched by SolrQoSFilter once their circuit
breaker cleared",
+ null);
+ metricExpired =
+ mm.longCounter(
+ registry,
+ METRIC_PREFIX + "expired.total",
+ "Suspended requests rejected by SolrQoSFilter after the suspension
timeout elapsed",
+ null);
+ metricRejected =
+ mm.longCounter(
+ registry,
+ METRIC_PREFIX + "rejected.total",
+ "Requests rejected fast by SolrQoSFilter because the suspension
queue was full",
+ null);
+ metricCurrentSuspendedGauge =
+ mm.observableLongGauge(
+ registry,
+ METRIC_PREFIX + "suspended.current",
+ "Requests currently suspended by SolrQoSFilter",
+ measurement -> measurement.record(suspendedCount.get()),
+ null);
+ }
+
+ @Override
+ public void destroy() {
+ if (drainTask != null) {
+ drainTask.cancel(false);
+ }
+ if (scheduler != null) {
+ scheduler.shutdownNow();
+ }
+ if (metricCurrentSuspendedGauge != null) {
+ metricCurrentSuspendedGauge.close();
+ }
+ super.destroy();
+ }
+
+ @Override
+ protected void doFilter(HttpServletRequest req, HttpServletResponse res,
FilterChain chain)
+ throws IOException, ServletException {
+ if (Boolean.TRUE.equals(req.getAttribute(ATTR_RESUMED))) {
+ // Resumed dispatch from drainQueue — already passed the breaker check;
let it through.
+ chain.doFilter(req, res);
+ return;
+ }
+
+ if (InternalRequestUtils.isInternalServerRequest(req)) {
+ // Sub-shard requests must not be queued or fail-fast — the parent
request is blocking on
+ // them and refusing them would cause distributed deadlock.
+ chain.doFilter(req, res);
+ return;
+ }
+
+ SolrRequestType type = inferRequestType(req);
+ CoreContainer cc;
+ try {
+ cc = resolveCoreContainer();
+ } catch (UnavailableException ex) {
+ // Container shutting down or not yet up — pass through.
+ chain.doFilter(req, res);
+ return;
+ }
+ List<CircuitBreaker> tripped = trippedCached(cc, type);
+ if (tripped == null) {
+ chain.doFilter(req, res);
+ return;
+ }
+
+ if (!enabled) {
+ // QoS queueing disabled: preserve the legacy SearchHandler /
ContentStreamHandlerBase
+ // behavior of failing fast with a 429 when a breaker is tripped.
+ res.sendError(
+ CircuitBreaker.getExceptionErrorCode().code,
+ CircuitBreakerRegistry.toErrorMessage(tripped));
+ return;
+ }
+
+ if (suspendedCount.get() >= maxSuspendedRequests) {
+ rejectQueueFull(res, "Server overloaded; suspended-request queue full");
+ return;
+ }
+
+ int priority = inferPriority(req, type);
+ int typeIdx = typeIndex(type);
+ if (suspendedByPriority[priority].get() >= priorityCap[priority]) {
+ // Lane saturated — refuse fast, but only at this priority. A flood of
LOW-priority work
+ // cannot starve HIGH-priority probes because each lane has its own cap.
+ rejectQueueFull(res, "Server overloaded; priority lane queue full");
+ return;
+ }
+
+ AsyncContext asyncContext = req.startAsync();
+ asyncContext.setTimeout(suspendTimeoutMs);
+ asyncContext.addListener(new TimeoutListener(priority, typeIdx));
+ suspendedCount.incrementAndGet();
+ suspendedByPriority[priority].incrementAndGet();
+ totalSuspended.incrementAndGet();
+ if (metricSuspended != null) {
+ metricSuspended.add(1);
+ }
+ queues[priority][typeIdx].add(asyncContext);
+ if (log.isDebugEnabled()) {
+ log.debug(
+ "SolrQoSFilter suspended request priority={} type={} ({} now
suspended)",
+ priority,
+ type,
+ suspendedCount.get());
+ }
+ }
+
+ private void rejectQueueFull(HttpServletResponse res, String message) throws
IOException {
+ totalRejected.incrementAndGet();
+ if (metricRejected != null) {
+ metricRejected.add(1);
+ }
+ res.sendError(CircuitBreaker.getExceptionErrorCode().code, message);
+ }
+
+ private void drainAll() {
+ try {
+ CoreContainer cc;
+ try {
+ cc = resolveCoreContainer();
+ } catch (UnavailableException ex) {
+ return;
+ }
+ for (int p = N_PRIORITIES - 1; p >= 0; p--) {
+ for (int t = 0; t < N_TYPES; t++) {
+ drainQueue(queues[p][t], p, typeFromIndex(t), cc,
priorityDrainBudget[p]);
+ }
+ }
+ } catch (Throwable t) {
+ log.error("SolrQoSFilter drain failed", t);
+ }
+ }
+
+ private void drainQueue(
+ Queue<AsyncContext> q, int priority, SolrRequestType type, CoreContainer
cc, int budgetCap) {
+ if (q.isEmpty()) {
+ return;
+ }
+ List<CircuitBreaker> tripped = trippedCached(cc, type);
+ if (tripped != null) {
+ return;
+ }
+ int budget = budgetCap;
+ while (budget-- > 0) {
+ AsyncContext popped = q.poll();
+ if (popped == null) {
+ return;
+ }
+ try {
+ HttpServletRequest req = (HttpServletRequest) popped.getRequest();
+ req.setAttribute(ATTR_RESUMED, Boolean.TRUE);
+ popped.dispatch();
+ suspendedCount.decrementAndGet();
+ suspendedByPriority[priority].decrementAndGet();
+ totalResumed.incrementAndGet();
+ if (metricResumed != null) {
+ metricResumed.add(1);
+ }
+ } catch (IllegalStateException ise) {
+ // Async context already terminal (timed out or errored). The
TimeoutListener already
+ // decremented both counters; just discard the dead entry.
+ if (log.isDebugEnabled()) {
+ log.debug("SolrQoSFilter dispatch raced with completion", ise);
+ }
+ }
+ }
+ }
+
+ /**
+ * Determine request type from the {@code Solr-Request-Type} header, falling
back to a crude path
+ * heuristic. Anything not recognized as UPDATE is treated as QUERY for
circuit-breaker
+ * accounting.
+ *
+ * <p>The path fallback only matches paths containing {@code /update};
everything else (including
+ * admin endpoints such as {@code /solr/admin/collections}, security APIs,
and core admin actions)
+ * is classified as QUERY. The current circuit-breaker registry only
differentiates QUERY from
+ * UPDATE breakers, so this coarseness is acceptable, but if a future
breaker type is added (e.g.
+ * ADMIN) this heuristic will need refinement.
+ */
+ static SolrRequestType inferRequestType(HttpServletRequest req) {
Review Comment:
`HttpSolrCall` also does this kind of work right? Can we have the logic
combined or look similar?
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/GcOverheadCircuitBreaker.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.management.GarbageCollectorMXBean;
+import java.lang.management.ManagementFactory;
+import java.util.List;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.solr.common.util.EnvUtils;
+
+/**
+ * Trips when the JVM is spending more than a configured percentage of
wall-clock time in garbage
+ * collection over a sliding window.
+ *
+ * <p>Complementary to {@link MemoryCircuitBreaker}: that breaker fires when
post-GC live data is
+ * exhausting the heap; this one fires when GC is keeping up (live data may
even be small) but
+ * consuming so much CPU that the application is starving. Both conditions
usually precede an OOM
+ * but each catches the other's blind spot.
+ *
+ * <p>The percentage is computed as
+ *
+ * <pre>{@code
+ * sum(GarbageCollectorMXBean.getCollectionTime()) over window
+ * ───────────────────────────────────────────────────────── × 100
+ * wall-clock window
+ * }</pre>
+ *
+ * <p>The window length defaults to {@link #DEFAULT_WINDOW_SECONDS} seconds
and is configurable via
+ * {@value #SYSPROP_WINDOW_SECONDS}. Within one window the breaker reports the
GC ratio over the
+ * time elapsed since the anchor sample; once the window length is exceeded,
the anchor slides
+ * forward to the most recent sample so the ratio reflects only recent
behavior.
+ *
+ * <p>{@link #isTripped()} is rate-limited by the same {@code
TtlSampledMetric} used by the
+ * load-average and CPU breakers ({@value
CircuitBreakerRegistry#SYSPROP_SAMPLE_TTL_MS}), so
+ * concurrent admission-control callers share one ratio computation per cache
window.
+ */
+public class GcOverheadCircuitBreaker extends CircuitBreaker {
+
+ public static final String SYSPROP_WINDOW_SECONDS =
+ CircuitBreakerRegistry.SYSPROP_PREFIX + "gcoverhead.windowSeconds";
+ public static final long DEFAULT_WINDOW_SECONDS = 30L;
+
+ private static final List<GarbageCollectorMXBean> GC_BEANS =
+ List.copyOf(ManagementFactory.getGarbageCollectorMXBeans());
+
+ private double overheadThresholdPercent;
+ private final long windowNanos;
+ private final TtlSampledMetric ratioCache;
+ private final AtomicReference<Sample> anchor = new AtomicReference<>();
+
+ private static final ThreadLocal<Double> seenOverheadPercent =
ThreadLocal.withInitial(() -> 0.0);
+ private static final ThreadLocal<Double> allowedOverheadPercent =
+ ThreadLocal.withInitial(() -> 0.0);
+
+ public GcOverheadCircuitBreaker() {
+ super();
+ long windowSeconds = EnvUtils.getPropertyAsLong(SYSPROP_WINDOW_SECONDS,
DEFAULT_WINDOW_SECONDS);
+ if (windowSeconds <= 0) {
+ throw new IllegalArgumentException(
+ "Invalid GC overhead window: " + windowSeconds + "s (must be
positive)");
+ }
+ this.windowNanos = TimeUnit.SECONDS.toNanos(windowSeconds);
+ this.ratioCache =
+ new TtlSampledMetric(
+ EnvUtils.getPropertyAsLong(
+ CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS,
+ CircuitBreakerRegistry.DEFAULT_SAMPLE_TTL_MS));
+ }
+
+ public GcOverheadCircuitBreaker setThreshold(double
thresholdValueInPercentage) {
+ if (thresholdValueInPercentage <= 0 || thresholdValueInPercentage > 100) {
+ throw new IllegalArgumentException(
+ "GC overhead threshold must be in (0, 100]; got " +
thresholdValueInPercentage);
+ }
+ this.overheadThresholdPercent = thresholdValueInPercentage;
+ return this;
+ }
+
+ public double getOverheadThresholdPercent() {
+ return overheadThresholdPercent;
+ }
+
+ @Override
+ public boolean isTripped() {
+ double localAllowed = overheadThresholdPercent;
+ double localSeen = ratioCache.get(this::computeGcOverheadPercent);
+
+ allowedOverheadPercent.set(localAllowed);
+ seenOverheadPercent.set(localSeen);
+
+ return localSeen >= localAllowed;
Review Comment:
Ok, I see this is also done in the memoryCircuitBreaker.
##########
solr/core/src/java/org/apache/solr/util/circuitbreaker/TtlSampledMetric.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.DoubleSupplier;
+
+/**
+ * Time-bounded cache around an expensive double-valued metric sample.
Concurrent readers within one
+ * TTL window share a single computed value; on cache miss, multiple threads
may each compute before
+ * the latest store wins, which is bounded and acceptable.
+ *
+ * <p>Used by {@link LoadAverageCircuitBreaker} and {@link CPUCircuitBreaker}
so that high-QPS
+ * admission control does not poll OS load-average syscalls or Prometheus
metric scans more often
+ * than the underlying signals can move.
+ */
+final class TtlSampledMetric {
+
+ private final long ttlNanos;
+ private final AtomicReference<Sample> sample = new AtomicReference<>();
+
+ TtlSampledMetric(long ttlMs) {
+ this.ttlNanos = TimeUnit.MILLISECONDS.toNanos(ttlMs);
+ }
+
+ double get(DoubleSupplier source) {
Review Comment:
And yeah the class should be typed so that the MemoryCircuitBreaker can just
use a long instead of casting the double.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]