Thanks a lot for the comments, Jun! Indeed this is a practical solution 
originated from the field and we really appreciate the guidance to make it more 
general. Please refer to the embedded response to the specific questions below.


On 2021/04/07 17:59:59, Jun Rao <j...@confluent.io.INVALID> wrote: 
> Hi, George,
> 
> A few more comments on the KIP.
> 
> 1. It would be useful to motivate the problem a bit more. For example, is
> the KIP trying to solve a transient broker problem (if so, for how long) or
> a permanent broker problem? It would also be useful to list some common
> causes that can slow the broker down.

[GS] The most common scenario that directly led to this change is disk failure, 
a permanent problem in the sense that it requires intervention from the admin. 
However we also saw this effective under situations where the broker can 
self-heal, such as temporary network connection issue. In the end as long as  
the short circuit mechanism itself can adapt, it is agnostic to duration of the 
failure. 

> 
> 2. It would be useful to discuss a bit more on the high level approach
> (e.g. in the rejected section). This KIP proposes to fix the issue on the
> client side by having a pluggable component to redirect the traffic to
> other brokers. One potential issue with this is that it requires all
> clients to opt in (assuming this is not the default) for the plugin to see
> the benefit. In some environments with a large number of clients,
> coordinating all those clients may not be easy. Another potential solution
> is to fix the issue on the server side. For example, if a broker is slow
> because it has noisy neighbors in a virtual environment, we could
> proactively bring down the broker and restart it somewhere else. This has
> the benefit that it requires less client side coordination.
> 

[GS] Agree with the judgement on the client coordination complexity. We added a 
section at the end of the KIP to elaborate a bit more. Basically we think 
client-side circuit breaking and server side broker high availability are 
complementary instead of conflicting. On one hand it is not likely (or 
extremely expensive) to implement broker HA in the control plane; on the other 
hand we have also often seen client side mechanism used to mitigate network 
problem between client and broker. An analogy is that most RPC frameworks 
implement filtering for problematic node on both server side and client side.

> 3. Regarding how to detect broker slowness in the client. The proposal is
> based on the error in the produce response. Typically, if the broker is
> just slow, the only type of error the client gets is the timeout exception.
> Since the default timeout is 30 seconds, it may not be triggered all the
> time and it may be too late to reflect a broker side issue. I am wondering
> if there are other better indicators. For example, another potential option
> is to use the number of pending batches per partition (or broker) in the
> Accumulator. Intuitively, if a broker is slow, all partitions with the
> leader on it will gradually accumulate more batches.
> 

[GS] It is indeed naive to rely only on timeout. We iterated this draft twice 
and the current version does support state context to be passed to custom 
implementation of the circuit breaker class, including write result, inflight 
requests and pending batches.

> 4. It would be useful to have a solution that works with keyed messages so
> that they can still be distributed to the partition based on the hash of
> the key.

[GS] Agree. We have not found straightforward way to support keyed messages. On 
the other hand, when we discuss the trade-off between message ordering (within 
controllable time window) and high availability, almost all our customer prefer 
the latter. Hence we have rolled out this without support of keyed messages. 

> 
> Thanks,
> 
> Jun
> 
> 
> On Wed, Mar 24, 2021 at 4:05 AM Guoqiang Shu <shuguoqi...@gmail.com> wrote:
> 
> >
> > In our current proposal it can be configured via
> > producer.circuit.breaker.mute.retry.interval (defaulted to 10 mins), but
> > perhaps 'interval' is a confusing name.
> >
> > On 2021/03/23 00:45:23, Guozhang Wang <wangg...@gmail.com> wrote:
> > > Thanks for the updated KIP! Some more comments inlined.
> > > >
> > > > I'm still not sure if, in your proposal, the muting length is a
> > > customizable value (and if yes, through which config) or it is always
> > hard
> > > coded as 10 minutes?
> > >
> > >
> > > > > Guozhang
> >
> >
> 

Reply via email to