Hello Guoqiang,

This is another interesting ticket that may be also related to the issues
you observed and fixed in your production, if you used sticky partitioner
in producer clients:

https://issues.apache.org/jira/browse/KAFKA-10888


Guozhang


On Wed, Apr 7, 2021 at 11:00 AM Jun Rao <j...@confluent.io.invalid> wrote:

> Hi, George,
>
> A few more comments on the KIP.
>
> 1. It would be useful to motivate the problem a bit more. For example, is
> the KIP trying to solve a transient broker problem (if so, for how long) or
> a permanent broker problem? It would also be useful to list some common
> causes that can slow the broker down.
>
> 2. It would be useful to discuss a bit more on the high level approach
> (e.g. in the rejected section). This KIP proposes to fix the issue on the
> client side by having a pluggable component to redirect the traffic to
> other brokers. One potential issue with this is that it requires all
> clients to opt in (assuming this is not the default) for the plugin to see
> the benefit. In some environments with a large number of clients,
> coordinating all those clients may not be easy. Another potential solution
> is to fix the issue on the server side. For example, if a broker is slow
> because it has noisy neighbors in a virtual environment, we could
> proactively bring down the broker and restart it somewhere else. This has
> the benefit that it requires less client side coordination.
>
> 3. Regarding how to detect broker slowness in the client. The proposal is
> based on the error in the produce response. Typically, if the broker is
> just slow, the only type of error the client gets is the timeout exception.
> Since the default timeout is 30 seconds, it may not be triggered all the
> time and it may be too late to reflect a broker side issue. I am wondering
> if there are other better indicators. For example, another potential option
> is to use the number of pending batches per partition (or broker) in the
> Accumulator. Intuitively, if a broker is slow, all partitions with the
> leader on it will gradually accumulate more batches.
>
> 4. It would be useful to have a solution that works with keyed messages so
> that they can still be distributed to the partition based on the hash of
> the key.
>
> Thanks,
>
> Jun
>
>
> On Wed, Mar 24, 2021 at 4:05 AM Guoqiang Shu <shuguoqi...@gmail.com>
> wrote:
>
> >
> > In our current proposal it can be configured via
> > producer.circuit.breaker.mute.retry.interval (defaulted to 10 mins), but
> > perhaps 'interval' is a confusing name.
> >
> > On 2021/03/23 00:45:23, Guozhang Wang <wangg...@gmail.com> wrote:
> > > Thanks for the updated KIP! Some more comments inlined.
> > > >
> > > > I'm still not sure if, in your proposal, the muting length is a
> > > customizable value (and if yes, through which config) or it is always
> > hard
> > > coded as 10 minutes?
> > >
> > >
> > > > > Guozhang
> >
> >
>


-- 
-- Guozhang

Reply via email to