sigram commented on a change in pull request #1737:
URL: https://github.com/apache/lucene-solr/pull/1737#discussion_r471407405



##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.management.ManagementFactory;
+import java.lang.management.OperatingSystemMXBean;
+
+import org.apache.solr.core.SolrConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * <p>
+ * Tracks current CPU usage and triggers if the specified threshold is 
breached.
+ *
+ * This circuit breaker gets the average CPU load over the last minute and uses
+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will 
introduce
+ * more complexity than the current structure and might not get us major 
performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.
+ * </p>
+ *
+ * <p>
+ * The configuration to define which mode to use and the trigger threshold are 
defined in
+ * solrconfig.xml
+ * </p>
+ */
+public class CPUCircuitBreaker extends CircuitBreaker {
+  private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+  private static final OperatingSystemMXBean operatingSystemMXBean = 
ManagementFactory.getOperatingSystemMXBean();
+
+  private final boolean isCpuCircuitBreakerEnabled;
+  private final double cpuUsageThreshold;
+
+  // Assumption -- the value of these parameters will be set correctly before 
invoking getDebugInfo()
+  private static final ThreadLocal<Double> seenCPUUsage = new ThreadLocal<>();
+  private static final ThreadLocal<Double> allowedCPUUsage = new 
ThreadLocal<>();
+
+  public CPUCircuitBreaker(SolrConfig solrConfig) {
+    super(solrConfig);
+
+    this.isCpuCircuitBreakerEnabled = solrConfig.isCpuCircuitBreakerEnabled;
+    this.cpuUsageThreshold = solrConfig.cpuCircuitBreakerThreshold;
+  }
+
+  @Override
+  public boolean isTripped() {
+    if (!isEnabled()) {
+      return false;
+    }
+
+    if (!isCpuCircuitBreakerEnabled) {
+      return false;
+    }
+
+    double localAllowedCPUUsage = getCpuUsageThreshold();
+    double localSeenCPUUsage = calculateLiveCPUUsage();
+
+    if (localSeenCPUUsage < 0) {
+      if (log.isWarnEnabled()) {
+        String msg = "Unable to get CPU usage";
+
+        log.warn(msg);
+      }
+
+      return false;
+    }
+
+    allowedCPUUsage.set(localAllowedCPUUsage);
+
+    seenCPUUsage.set(localSeenCPUUsage);
+
+    return (localSeenCPUUsage >= localAllowedCPUUsage);
+  }
+
+  @Override
+  public String getDebugInfo() {
+    if (seenCPUUsage.get() == 0L || allowedCPUUsage.get() == 0L) {

Review comment:
       I think that on `ThreadLocal.get()` you would get a null value and 
consequently an NPE. You would have to use `ThreadLocal.withInitial(supplier)` 
to get a `Double(0)` here.

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.management.ManagementFactory;
+import java.lang.management.OperatingSystemMXBean;
+
+import org.apache.solr.core.SolrConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * <p>
+ * Tracks current CPU usage and triggers if the specified threshold is 
breached.
+ *
+ * This circuit breaker gets the average CPU load over the last minute and uses
+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will 
introduce
+ * more complexity than the current structure and might not get us major 
performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.

Review comment:
       Javadoc for OSMxBean says "This method is designed to provide a hint 
about the system load and may be queried frequently." Not sure what 
"frequently" means here, though...
   
   It will be interesting to see the dynamic behavior of this breaker - I'm 
somewhat worried that the 1-min average may be too unstable and we end up 
flip-flopping between the states too often (eg. every dozen requests or so). 
Depending on the volume of momentary load (eg. an ongoing large merge or split 
shard) the 1-min average may exceed a threshold even though it doesn't properly 
reflect a sustained longer-term overload that we should worry about. Average 
load is a convenient lie ;)
   
   OTOH that's the only method we have in the OS MXBean, so we would have to 
compute that longer-term average ourselves, which is messy... so I guess we'll 
see .

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.management.ManagementFactory;
+import java.lang.management.OperatingSystemMXBean;
+
+import org.apache.solr.core.SolrConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * <p>
+ * Tracks current CPU usage and triggers if the specified threshold is 
breached.
+ *
+ * This circuit breaker gets the average CPU load over the last minute and uses
+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will 
introduce
+ * more complexity than the current structure and might not get us major 
performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.

Review comment:
       Also, caching doesn't have to be complicated .. you simply check the 
elapsed time since last request and if it's longer than timeout you refresh the 
value first, it's probably less than 10 lines of code.

##########
File path: solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
##########
@@ -157,6 +163,57 @@ public void testBuildingMemoryPressure() {
     }
   }
 
+  public void testFakeCPUCircuitBreaker() {
+    ExecutorService executor = ExecutorUtil.newMDCAwareCachedThreadPool(
+        new SolrNamedThreadFactory("TestCircuitBreaker"));
+    HashMap<String, String> args = new HashMap<String, String>();
+
+    args.put(QueryParsing.DEFTYPE, CircuitBreaker.NAME);
+    args.put(CommonParams.FL, "id");
+
+    AtomicInteger failureCount = new AtomicInteger();
+
+    try {
+      removeAllExistingCircuitBreakers();
+
+      CircuitBreaker circuitBreaker = new 
FakeCPUCircuitBreaker(h.getCore().getSolrConfig());
+
+      h.getCore().getCircuitBreakerManager().register(circuitBreaker);
+
+      for (int i = 0; i < 5; i++) {
+        executor.submit(() -> {
+          try {
+            h.query(req("name:\"john smith\""));
+          } catch (SolrException e) {

Review comment:
       Currently the `CircuitBreakerManager.toErrorMessage` uses individual 
breaker's `getDebugInfo` to construct the error message. I think we should 
improve these individual messages to clearly say why a break was tripped. 
Currently it's not obvious - eg. when a mem breaker trips it just reports the 
current and total memory but not the thresholdPct.
   
   Maybe we need a different method like `CircuitBreaker.getErrorMessage` ?

##########
File path: solr/core/src/java/org/apache/solr/core/SolrConfig.java
##########
@@ -229,9 +229,13 @@ private SolrConfig(SolrResourceLoader loader, String name, 
boolean isConfigsetTr
     enableLazyFieldLoading = getBool("query/enableLazyFieldLoading", false);
 
     useCircuitBreakers = getBool("circuitBreaker/useCircuitBreakers", false);
+    isCpuCircuitBreakerEnabled = 
getBool("circuitBreaker/isCpuCircuitBreakerEnabled", false);

Review comment:
       I'm not sure if the boolean flags should always contain `is`, also I 
generally hate too long names... ;) we already know this is a section for 
circuit breakers, so the name doesn't have to repeat all of it. How about 
`cpuBreakerEnabled`, `memoryBreakerEnabled` etc? 

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
##########
@@ -43,6 +43,7 @@
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
   private static final MemoryMXBean MEMORY_MX_BEAN = 
ManagementFactory.getMemoryMXBean();
 
+  private boolean isMemoryCircuitBreakerEnabled;

Review comment:
       We know it's a boolean and it's in the MemoryCircuitBreaker, why not 
simply call it `enabled` (like many other Solr plugins do)?

##########
File path: solr/solr-ref-guide/src/circuit-breakers.adoc
##########
@@ -35,33 +35,67 @@ will be disabled globally. Per circuit breaker 
configurations are specified in t
 <useCircuitBreakers>false</useCircuitBreakers>
 ----
 
+This flag acts as the highest authority and global controller of circuit 
breakers. For using specific circuit breakers, each one
+needs to be individually enabled in addition to this flag being enabled.
+
 == Currently Supported Circuit Breakers
 
 === JVM Heap Usage Based Circuit Breaker
 This circuit breaker tracks JVM heap memory usage and rejects incoming search 
requests with a 503 error code if the heap usage
 exceeds a configured percentage of maximum heap allocated to the JVM (-Xmx). 
The main configuration for this circuit breaker is
 controlling the threshold percentage at which the breaker will trip.
 
-It does not logically make sense to have a threshold below 50% and above 95% 
of the max heap allocated to the JVM. Hence, the range
-of valid values for this parameter is [50, 95], both inclusive.
+To enable/disable JVM heap usage based circuit breaker, use the following 
configuration:
+
+[source,xml]
+----
+<isMemoryCircuitBreakerEnabled>true</isMemoryCircuitBreakerEnabled>
+----
+
+Note that this configuration will be overridden by the global circuit breaker 
flag -- if circuit breakers are disabled, this flag
+will not help you.
+
+To set the triggering threshold as a percentage of the max heap allocated to 
the JVM, use the following parameter.
 
 [source,xml]
 ----
 <memoryCircuitBreakerThresholdPct>75</memoryCircuitBreakerThresholdPct>
 ----
+It does not logically make sense to have a threshold below 50% and above 95% 
of the max heap allocated to the JVM. Hence, the range
+of valid values for this parameter is [50, 95], both inclusive.
 
 Consider the following example:
 
 JVM has been allocated a maximum heap of 5GB (-Xmx) and 
memoryCircuitBreakerThresholdPct is set to 75. In this scenario, the heap usage
 at which the circuit breaker will trip is 3.75GB.
 
-Note that this circuit breaker is checked for each incoming search request and 
considers the current heap usage of the node, i.e every search
-request will get the live heap usage and compare it against the set memory 
threshold. The check does not impact performance,
-but any performance regressions that are suspected to be caused by this 
feature should be reported to the dev list.
 
+=== CPU Utilization Based Circuit Breaker
+This circuit breaker tracks CPU utilization and triggers if the average CPU 
utilization over the last minute
+exceeds a configurable threshold.
+
+To enable/disable CPU utilization based circuit breaker, use the following 
configuration:
+
+[source,xml]
+----
+<isCpuCircuitBreakerEnabled>true</isCpuCircuitBreakerEnabled>
+----
+
+Note that this configuration will be overridden by the global circuit breaker 
flag -- if circuit breakers are disabled, this flag
+will not help you.
+
+To set the triggering threshold in units of CPU utilization, use the following 
parameter.
+[source,xml]
+----
+<cpuCircuitBreakerThreshold>75</cpuCircuitBreakerThreshold>
+----
+The range of valid values for this parameter is [40 to infinity].

Review comment:
       This is still not removed in the latest rev?
   
   Also, we can be more helpful here: maybe add something like "typically 1-min 
average CPU load may often reach values of 10s for a busy machine, but values 
consistently above 20 and higher indicate progressively serious contention." - 
or something like that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to