Ideally, it would only affect a few queries. In reality, with a sharded system, 
the impact will be large.

I disagree that the goal is to protect a node. The goal is to make the entire 
cluster avoid congestion failure when overloaded, while providing good service 
for the load that it can handle.

I have had Solr clusters take down entire websites when overloaded, both at 
Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge 
fan of circuit breakers.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 14, 2021, at 9:50 AM, Atri Sharma <a...@apache.org> wrote:
> 
> This has an issue of still leading to node outages if the fanout for a query 
> is high.
> 
> Circuit breakers follow a simple rule -- defend the node at the cost of 
> degraded responses.
> 
> Ideally, only few requests will be completely rejected -- some will see 
> partial results. Due to this non discriminating nature of circuit breakers, 
> the typical blip on service quality due to high resource usage is short lived.
> 
> However, it is possible to write a circuit breaker which rejects only 
> external requests in master branch (we have the ability to identify requests 
> as internal or external there).
> 
> Regards,
> 
> Atri
> 
> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wun...@wunderwood.org 
> <mailto:wun...@wunderwood.org>> wrote:
> This got zero responses on the solr-user list, so I’ll raise the issue here.
> 
> Should circuit breakers only kill external search requests and not 
> cluster-internal requests to shards?
> 
> Circuit breakers can kill any request, whether it is a client request from 
> outside the cluster or an internal distributed request to a shard. Killing a 
> portion of distributed request will affect the main request. Not sure whether 
> a 503 from a shard will kill the whole request or cause partial results, but 
> it isn’t good.
> 
> We run with 8 shards. If a circuit breaker is killing 10% of requests on each 
> host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems 
> like “overkill” to me. If it only kills external requests, then 10% means 10%.
> 
> Killing only external requests requires that external requests go roughly 
> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)

Reply via email to