Ideally, it would only affect a few queries. In reality, with a sharded system, the impact will be large.
I disagree that the goal is to protect a node. The goal is to make the entire cluster avoid congestion failure when overloaded, while providing good service for the load that it can handle. I have had Solr clusters take down entire websites when overloaded, both at Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge fan of circuit breakers. wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Feb 14, 2021, at 9:50 AM, Atri Sharma <[email protected]> wrote: > > This has an issue of still leading to node outages if the fanout for a query > is high. > > Circuit breakers follow a simple rule -- defend the node at the cost of > degraded responses. > > Ideally, only few requests will be completely rejected -- some will see > partial results. Due to this non discriminating nature of circuit breakers, > the typical blip on service quality due to high resource usage is short lived. > > However, it is possible to write a circuit breaker which rejects only > external requests in master branch (we have the ability to identify requests > as internal or external there). > > Regards, > > Atri > > On Sun, 14 Feb 2021, 23:07 Walter Underwood, <[email protected] > <mailto:[email protected]>> wrote: > This got zero responses on the solr-user list, so I’ll raise the issue here. > > Should circuit breakers only kill external search requests and not > cluster-internal requests to shards? > > Circuit breakers can kill any request, whether it is a client request from > outside the cluster or an internal distributed request to a shard. Killing a > portion of distributed request will affect the main request. Not sure whether > a 503 from a shard will kill the whole request or cause partial results, but > it isn’t good. > > We run with 8 shards. If a circuit breaker is killing 10% of requests on each > host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems > like “overkill” to me. If it only kills external requests, then 10% means 10%. > > Killing only external requests requires that external requests go roughly > equally to all hosts in the cluster, or at least all NRT or PULL replicas. > > wunder > Walter Underwood > [email protected] <mailto:[email protected]> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
