Limiting open connections is not the same as rate limiting. Open connections is a count of the requests being processed by a node. When the load balancer gets a new request and all current connections are waiting for a response, a new connection is opened.
If the requests are all the same query and returned from the query cache, the rate can be very high with a few connections. If the request are very slow, like deep paging, it only takes a few hundred requests to max out the connections. 100 queries/sec could be 5% CPU or 100% CPU. Think of the count of requests waiting to be handled (number of active connections) as like a cluster-wide load average. On connection per request being processed, plus one connection per request waiting. wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Feb 16, 2021, at 8:53 AM, David Smiley <[email protected]> wrote: > > Walter, it sounds like you were doing rate limiting, just in a different way > that is more dynamic than a simple (yet fiddly) constant? > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > <http://www.linkedin.com/in/davidwsmiley> > > On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood <[email protected] > <mailto:[email protected]>> wrote: > Rate limiting is a good idea. It requires a lot of ongoing engineering to > adjust the rates to the current cluster behavior. It doesn’t help with some > kinds of overload. The ROI just doesn’t work out. It is too much work for not > enough benefit. > > Rate limiting works if the collection size doesn’t change and the queries > don’t change. > > At Netflix, we limited traffic based on number of connections to each server. > This is basically the length of the queue of requests for that server. This > is similar to limiting by load average, which is also the work waiting to be > done. It has the same weaknesses as the load average circuit breaker, but it > did not need to be changed when average CPU usage per query increased. It was > “set and forget”. Rate limiters require constant adjustment. > > wunder > Walter Underwood > [email protected] <mailto:[email protected]> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) > >> On Feb 14, 2021, at 11:44 AM, Atri Sharma <[email protected] >> <mailto:[email protected]>> wrote: >> >> This is a debate better suited for a different forum -- but I would >> disagree with your assertion that rate limiting is a bad idea. >> >> Solr allows you to specify node level request quotas which also follow the >> principle of not limiting internal requests. I find that to be pretty useful >> in two forms: 1. Use it in conjunction with a global request limit which is >> typically 0.75 of my total load capacity given my average query resource >> consumption. 2. Allow per node request limits to ensure fairness and >> dedicated capacity for different types of requests. 3. Allow circuit >> breakers to handle cases where a couple of rogue queries can take down nodes. >> >> We digress -- as I said, it should be fairly simple to have a circuit >> breaker which rejects only external requests, but should be clearly >> documented with its downsides. >> >> On Mon, 15 Feb 2021, 00:33 Walter Underwood, <[email protected] >> <mailto:[email protected]>> wrote: >> We’ve looked at and rejected rate limiters as high-maintenance and not >> sufficient protection. >> >> We would have run nginx on each node, sent external traffic to nginx on a >> different port and let internal traffic stay on the default Solr port. This >> has other advantages (monitoring), but the rate limiting part is way too >> fiddly. >> >> Rates depend on how much CPU is used per query and on the size of the >> cluster (if they are not on each node). Some examples from our largest >> cluster which would need a change in rate limits. Some of these could be set >> by doing offline load benchmarks, some not. >> >> * Experiment cell that uses 2.5X more CPU for each query (running now in >> prod) >> * Increasing traffic allocated to that cell (did this last week) >> * Increase in index size (number of docs and CPU requirements increase about >> 5% every month) >> * Website slowdown that shifts most traffic to mobile, where queries use 2X >> as much CPU >> * Horizontal scaling from 24 tp 48 nodes >> * Vertical scaling from c5.8xlarge to c5.18xlarge >> >> And so on. Rate limiting would require almost weekly load benchmarks and it >> still wouldn’t catch the outage-causing problems. >> >> wunder >> Walter Underwood >> [email protected] <mailto:[email protected]> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) >> >>> On Feb 14, 2021, at 10:25 AM, Atri Sharma <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> The way I look at it is that for cluster level stability, rate limiters >>> should be used which allow rate limiting of only external requests. They >>> are "circuit breakers" in the sense of defending against cluster level >>> instability, which is what you describe. >>> >>> Circuit breakers, in Solr world, are targeted to be the last resort defense >>> of a node. >>> >>> As I said earlier, it is possible to write a circuit breaker which rejects >>> only external requests, but I personally do not see the benefit in presence >>> of rate limiters. >>> >>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <[email protected] >>> <mailto:[email protected]>> wrote: >>> Ideally, it would only affect a few queries. In reality, with a sharded >>> system, the impact will be large. >>> >>> I disagree that the goal is to protect a node. The goal is to make the >>> entire cluster avoid congestion failure when overloaded, while providing >>> good service for the load that it can handle. >>> >>> I have had Solr clusters take down entire websites when overloaded, both at >>> Netflix and Chegg, and I’ve built defenses for this at both places. I’m a >>> huge fan of circuit breakers. >>> >>> wunder >>> Walter Underwood >>> [email protected] <mailto:[email protected]> >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) >>> >>>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> This has an issue of still leading to node outages if the fanout for a >>>> query is high. >>>> >>>> Circuit breakers follow a simple rule -- defend the node at the cost of >>>> degraded responses. >>>> >>>> Ideally, only few requests will be completely rejected -- some will see >>>> partial results. Due to this non discriminating nature of circuit >>>> breakers, the typical blip on service quality due to high resource usage >>>> is short lived. >>>> >>>> However, it is possible to write a circuit breaker which rejects only >>>> external requests in master branch (we have the ability to identify >>>> requests as internal or external there). >>>> >>>> Regards, >>>> >>>> Atri >>>> >>>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> This got zero responses on the solr-user list, so I’ll raise the issue >>>> here. >>>> >>>> Should circuit breakers only kill external search requests and not >>>> cluster-internal requests to shards? >>>> >>>> Circuit breakers can kill any request, whether it is a client request from >>>> outside the cluster or an internal distributed request to a shard. Killing >>>> a portion of distributed request will affect the main request. Not sure >>>> whether a 503 from a shard will kill the whole request or cause partial >>>> results, but it isn’t good. >>>> >>>> We run with 8 shards. If a circuit breaker is killing 10% of requests on >>>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That >>>> seems like “overkill” to me. If it only kills external requests, then 10% >>>> means 10%. >>>> >>>> Killing only external requests requires that external requests go roughly >>>> equally to all hosts in the cluster, or at least all NRT or PULL replicas. >>>> >>>> wunder >>>> Walter Underwood >>>> [email protected] <mailto:[email protected]> >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my >>>> blog) >>> >> >
