Re: Circuit Breakers interaction with Shards

Walter Underwood Tue, 16 Feb 2021 11:56:26 -0800

Limiting open connections is not the same as rate limiting. Open connections is 
a count of the requests being processed by a node. When the load balancer gets 
a new request and all current connections are waiting for a response, a new 
connection is opened.


If the requests are all the same query and returned from the query cache, the 
rate can be very high with a few connections. If the request are very slow, 
like deep paging, it only takes a few hundred requests to max out the 
connections. 100 queries/sec could be 5% CPU or 100% CPU. 

Think of the count of requests waiting to be handled (number of active 
connections) as like a cluster-wide load average. On connection per request 
being processed, plus one connection per request waiting.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Feb 16, 2021, at 8:53 AM, David Smiley <[email protected]> wrote:
> 
> Walter, it sounds like you were doing rate limiting, just in a different way 
> that is more dynamic than a simple (yet fiddly) constant?
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood <[email protected] 
> <mailto:[email protected]>> wrote:
> Rate limiting is a good idea. It requires a lot of ongoing engineering to 
> adjust the rates to the current cluster behavior. It doesn’t help with some 
> kinds of overload. The ROI just doesn’t work out. It is too much work for not 
> enough benefit.
> 
> Rate limiting works if the collection size doesn’t change and the queries 
> don’t change.
> 
> At Netflix, we limited traffic based on number of connections to each server. 
> This is basically the length of the queue of requests for that server. This 
> is similar to limiting by load average, which is also the work waiting to be 
> done. It has the same weaknesses as the load average circuit breaker, but it 
> did not need to be changed when average CPU usage per query increased. It was 
> “set and forget”. Rate limiters require constant adjustment.
> 
> wunder
> Walter Underwood
> [email protected] <mailto:[email protected]>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Feb 14, 2021, at 11:44 AM, Atri Sharma <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> This is a debate better suited for  a different forum  -- but I would 
>> disagree with your assertion that rate limiting is a bad idea.
>> 
>> Solr allows you to specify node level request quotas which also follow the 
>> principle of not limiting internal requests. I find that to be pretty useful 
>> in two forms: 1. Use it in conjunction with a global request limit which is 
>> typically 0.75 of my total load capacity given my average query resource 
>> consumption. 2. Allow per node request limits to ensure fairness and 
>> dedicated capacity for different types of requests. 3. Allow circuit 
>> breakers to handle cases where a couple of rogue queries can take down nodes.
>> 
>> We digress -- as I said, it should be fairly simple to have a circuit 
>> breaker which rejects only external requests,  but should be clearly 
>> documented with its downsides.
>> 
>> On Mon, 15 Feb 2021, 00:33 Walter Underwood, <[email protected] 
>> <mailto:[email protected]>> wrote:
>> We’ve looked at and rejected rate limiters as high-maintenance and not 
>> sufficient protection.
>> 
>> We would have run nginx on each node, sent external traffic to nginx on a 
>> different port and let internal traffic stay on the default Solr port. This 
>> has other advantages (monitoring), but the rate limiting part is way too 
>> fiddly.
>> 
>> Rates depend on how much CPU is used per query and on the size of the 
>> cluster (if they are not on each node). Some examples from our largest 
>> cluster which would need a change in rate limits. Some of these could be set 
>> by doing offline load benchmarks, some not.
>> 
>> * Experiment cell that uses 2.5X more CPU for each query (running now in 
>> prod)
>> * Increasing traffic allocated to that cell (did this last week)
>> * Increase in index size (number of docs and CPU requirements increase about 
>> 5% every month)
>> * Website slowdown that shifts most traffic to mobile, where queries use 2X 
>> as much CPU
>> * Horizontal scaling from 24 tp 48 nodes
>> * Vertical scaling from c5.8xlarge to c5.18xlarge
>> 
>> And so on. Rate limiting would require almost weekly load benchmarks and it 
>> still wouldn’t catch the outage-causing problems.
>> 
>> wunder
>> Walter Underwood
>> [email protected] <mailto:[email protected]>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Feb 14, 2021, at 10:25 AM, Atri Sharma <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> The way I look at it is that for cluster level stability, rate limiters 
>>> should be used which allow rate limiting of only external requests. They 
>>> are "circuit breakers" in the sense of defending against cluster level 
>>> instability, which is what you describe.
>>> 
>>> Circuit breakers, in Solr world, are targeted to be the last resort defense 
>>> of a node.
>>> 
>>> As I said earlier, it is possible to write a circuit breaker which rejects 
>>> only external requests, but I personally do not see the benefit in presence 
>>> of rate limiters.
>>> 
>>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Ideally, it would only affect a few queries. In reality, with a sharded 
>>> system, the impact will be large.
>>> 
>>> I disagree that the goal is to protect a node. The goal is to make the 
>>> entire cluster avoid congestion failure when overloaded, while providing 
>>> good service for the load that it can handle.
>>> 
>>> I have had Solr clusters take down entire websites when overloaded, both at 
>>> Netflix and Chegg, and I’ve built defenses for this at both places. I’m a 
>>> huge fan of circuit breakers.
>>> 
>>> wunder
>>> Walter Underwood
>>> [email protected] <mailto:[email protected]>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>> 
>>>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> This has an issue of still leading to node outages if the fanout for a 
>>>> query is high.
>>>> 
>>>> Circuit breakers follow a simple rule -- defend the node at the cost of 
>>>> degraded responses.
>>>> 
>>>> Ideally, only few requests will be completely rejected -- some will see 
>>>> partial results. Due to this non discriminating nature of circuit 
>>>> breakers, the typical blip on service quality due to high resource usage 
>>>> is short lived.
>>>> 
>>>> However, it is possible to write a circuit breaker which rejects only 
>>>> external requests in master branch (we have the ability to identify 
>>>> requests as internal or external there).
>>>> 
>>>> Regards,
>>>> 
>>>> Atri
>>>> 
>>>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> This got zero responses on the solr-user list, so I’ll raise the issue 
>>>> here.
>>>> 
>>>> Should circuit breakers only kill external search requests and not 
>>>> cluster-internal requests to shards?
>>>> 
>>>> Circuit breakers can kill any request, whether it is a client request from 
>>>> outside the cluster or an internal distributed request to a shard. Killing 
>>>> a portion of distributed request will affect the main request. Not sure 
>>>> whether a 503 from a shard will kill the whole request or cause partial 
>>>> results, but it isn’t good.
>>>> 
>>>> We run with 8 shards. If a circuit breaker is killing 10% of requests on 
>>>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That 
>>>> seems like “overkill” to me. If it only kills external requests, then 10% 
>>>> means 10%.
>>>> 
>>>> Killing only external requests requires that external requests go roughly 
>>>> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> [email protected] <mailto:[email protected]>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
>>>> blog)
>>> 
>> 
>

Re: Circuit Breakers interaction with Shards

Reply via email to