sigram commented on a change in pull request #1737: URL: https://github.com/apache/lucene-solr/pull/1737#discussion_r471407405
########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java ########## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.lang.invoke.MethodHandles; +import java.lang.management.ManagementFactory; +import java.lang.management.OperatingSystemMXBean; + +import org.apache.solr.core.SolrConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * <p> + * Tracks current CPU usage and triggers if the specified threshold is breached. + * + * This circuit breaker gets the average CPU load over the last minute and uses + * that data to take a decision. Ideally, we should be able to cache the value + * locally and only query once the minute has elapsed. However, that will introduce + * more complexity than the current structure and might not get us major performance + * wins. If this ever becomes a performance bottleneck, that can be considered. + * </p> + * + * <p> + * The configuration to define which mode to use and the trigger threshold are defined in + * solrconfig.xml + * </p> + */ +public class CPUCircuitBreaker extends CircuitBreaker { + private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); + private static final OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean(); + + private final boolean isCpuCircuitBreakerEnabled; + private final double cpuUsageThreshold; + + // Assumption -- the value of these parameters will be set correctly before invoking getDebugInfo() + private static final ThreadLocal<Double> seenCPUUsage = new ThreadLocal<>(); + private static final ThreadLocal<Double> allowedCPUUsage = new ThreadLocal<>(); + + public CPUCircuitBreaker(SolrConfig solrConfig) { + super(solrConfig); + + this.isCpuCircuitBreakerEnabled = solrConfig.isCpuCircuitBreakerEnabled; + this.cpuUsageThreshold = solrConfig.cpuCircuitBreakerThreshold; + } + + @Override + public boolean isTripped() { + if (!isEnabled()) { + return false; + } + + if (!isCpuCircuitBreakerEnabled) { + return false; + } + + double localAllowedCPUUsage = getCpuUsageThreshold(); + double localSeenCPUUsage = calculateLiveCPUUsage(); + + if (localSeenCPUUsage < 0) { + if (log.isWarnEnabled()) { + String msg = "Unable to get CPU usage"; + + log.warn(msg); + } + + return false; + } + + allowedCPUUsage.set(localAllowedCPUUsage); + + seenCPUUsage.set(localSeenCPUUsage); + + return (localSeenCPUUsage >= localAllowedCPUUsage); + } + + @Override + public String getDebugInfo() { + if (seenCPUUsage.get() == 0L || allowedCPUUsage.get() == 0L) { Review comment: I think that on `ThreadLocal.get()` you would get a null value and consequently an NPE. You would have to use `ThreadLocal.withInitial(supplier)` to get a `Double(0)` here. ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java ########## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.lang.invoke.MethodHandles; +import java.lang.management.ManagementFactory; +import java.lang.management.OperatingSystemMXBean; + +import org.apache.solr.core.SolrConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * <p> + * Tracks current CPU usage and triggers if the specified threshold is breached. + * + * This circuit breaker gets the average CPU load over the last minute and uses + * that data to take a decision. Ideally, we should be able to cache the value + * locally and only query once the minute has elapsed. However, that will introduce + * more complexity than the current structure and might not get us major performance + * wins. If this ever becomes a performance bottleneck, that can be considered. Review comment: Javadoc for OSMxBean says "This method is designed to provide a hint about the system load and may be queried frequently." Not sure what "frequently" means here, though... It will be interesting to see the dynamic behavior of this breaker - I'm somewhat worried that the 1-min average may be too unstable and we end up flip-flopping between the states too often (eg. every dozen requests or so). Depending on the volume of momentary load (eg. an ongoing large merge or split shard) the 1-min average may exceed a threshold even though it doesn't properly reflect a sustained longer-term overload that we should worry about. Average load is a convenient lie ;) OTOH that's the only method we have in the OS MXBean, so we would have to compute that longer-term average ourselves, which is messy... so I guess we'll see . ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java ########## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.lang.invoke.MethodHandles; +import java.lang.management.ManagementFactory; +import java.lang.management.OperatingSystemMXBean; + +import org.apache.solr.core.SolrConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * <p> + * Tracks current CPU usage and triggers if the specified threshold is breached. + * + * This circuit breaker gets the average CPU load over the last minute and uses + * that data to take a decision. Ideally, we should be able to cache the value + * locally and only query once the minute has elapsed. However, that will introduce + * more complexity than the current structure and might not get us major performance + * wins. If this ever becomes a performance bottleneck, that can be considered. Review comment: Also, caching doesn't have to be complicated .. you simply check the elapsed time since last request and if it's longer than timeout you refresh the value first, it's probably less than 10 lines of code. ########## File path: solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java ########## @@ -157,6 +163,57 @@ public void testBuildingMemoryPressure() { } } + public void testFakeCPUCircuitBreaker() { + ExecutorService executor = ExecutorUtil.newMDCAwareCachedThreadPool( + new SolrNamedThreadFactory("TestCircuitBreaker")); + HashMap<String, String> args = new HashMap<String, String>(); + + args.put(QueryParsing.DEFTYPE, CircuitBreaker.NAME); + args.put(CommonParams.FL, "id"); + + AtomicInteger failureCount = new AtomicInteger(); + + try { + removeAllExistingCircuitBreakers(); + + CircuitBreaker circuitBreaker = new FakeCPUCircuitBreaker(h.getCore().getSolrConfig()); + + h.getCore().getCircuitBreakerManager().register(circuitBreaker); + + for (int i = 0; i < 5; i++) { + executor.submit(() -> { + try { + h.query(req("name:\"john smith\"")); + } catch (SolrException e) { Review comment: Currently the `CircuitBreakerManager.toErrorMessage` uses individual breaker's `getDebugInfo` to construct the error message. I think we should improve these individual messages to clearly say why a break was tripped. Currently it's not obvious - eg. when a mem breaker trips it just reports the current and total memory but not the thresholdPct. Maybe we need a different method like `CircuitBreaker.getErrorMessage` ? ########## File path: solr/core/src/java/org/apache/solr/core/SolrConfig.java ########## @@ -229,9 +229,13 @@ private SolrConfig(SolrResourceLoader loader, String name, boolean isConfigsetTr enableLazyFieldLoading = getBool("query/enableLazyFieldLoading", false); useCircuitBreakers = getBool("circuitBreaker/useCircuitBreakers", false); + isCpuCircuitBreakerEnabled = getBool("circuitBreaker/isCpuCircuitBreakerEnabled", false); Review comment: I'm not sure if the boolean flags should always contain `is`, also I generally hate too long names... ;) we already know this is a section for circuit breakers, so the name doesn't have to repeat all of it. How about `cpuBreakerEnabled`, `memoryBreakerEnabled` etc? ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java ########## @@ -43,6 +43,7 @@ private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); private static final MemoryMXBean MEMORY_MX_BEAN = ManagementFactory.getMemoryMXBean(); + private boolean isMemoryCircuitBreakerEnabled; Review comment: We know it's a boolean and it's in the MemoryCircuitBreaker, why not simply call it `enabled` (like many other Solr plugins do)? ########## File path: solr/solr-ref-guide/src/circuit-breakers.adoc ########## @@ -35,33 +35,67 @@ will be disabled globally. Per circuit breaker configurations are specified in t <useCircuitBreakers>false</useCircuitBreakers> ---- +This flag acts as the highest authority and global controller of circuit breakers. For using specific circuit breakers, each one +needs to be individually enabled in addition to this flag being enabled. + == Currently Supported Circuit Breakers === JVM Heap Usage Based Circuit Breaker This circuit breaker tracks JVM heap memory usage and rejects incoming search requests with a 503 error code if the heap usage exceeds a configured percentage of maximum heap allocated to the JVM (-Xmx). The main configuration for this circuit breaker is controlling the threshold percentage at which the breaker will trip. -It does not logically make sense to have a threshold below 50% and above 95% of the max heap allocated to the JVM. Hence, the range -of valid values for this parameter is [50, 95], both inclusive. +To enable/disable JVM heap usage based circuit breaker, use the following configuration: + +[source,xml] +---- +<isMemoryCircuitBreakerEnabled>true</isMemoryCircuitBreakerEnabled> +---- + +Note that this configuration will be overridden by the global circuit breaker flag -- if circuit breakers are disabled, this flag +will not help you. + +To set the triggering threshold as a percentage of the max heap allocated to the JVM, use the following parameter. [source,xml] ---- <memoryCircuitBreakerThresholdPct>75</memoryCircuitBreakerThresholdPct> ---- +It does not logically make sense to have a threshold below 50% and above 95% of the max heap allocated to the JVM. Hence, the range +of valid values for this parameter is [50, 95], both inclusive. Consider the following example: JVM has been allocated a maximum heap of 5GB (-Xmx) and memoryCircuitBreakerThresholdPct is set to 75. In this scenario, the heap usage at which the circuit breaker will trip is 3.75GB. -Note that this circuit breaker is checked for each incoming search request and considers the current heap usage of the node, i.e every search -request will get the live heap usage and compare it against the set memory threshold. The check does not impact performance, -but any performance regressions that are suspected to be caused by this feature should be reported to the dev list. +=== CPU Utilization Based Circuit Breaker +This circuit breaker tracks CPU utilization and triggers if the average CPU utilization over the last minute +exceeds a configurable threshold. + +To enable/disable CPU utilization based circuit breaker, use the following configuration: + +[source,xml] +---- +<isCpuCircuitBreakerEnabled>true</isCpuCircuitBreakerEnabled> +---- + +Note that this configuration will be overridden by the global circuit breaker flag -- if circuit breakers are disabled, this flag +will not help you. + +To set the triggering threshold in units of CPU utilization, use the following parameter. +[source,xml] +---- +<cpuCircuitBreakerThreshold>75</cpuCircuitBreakerThreshold> +---- +The range of valid values for this parameter is [40 to infinity]. Review comment: This is still not removed in the latest rev? Also, we can be more helpful here: maybe add something like "typically 1-min average CPU load may often reach values of 10s for a busy machine, but values consistently above 20 and higher indicate progressively serious contention." - or something like that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org