Hi Sanjana and thanks for the KIP!

Sorry for the late response, but I still have a few questions that you
might find useful.

The KIP currently does not mention Kafka Connect at all. I have read
the discussion above where it'd been decided to leave Connect and Streams
out of the proposed changes, but I feel this should be called out
explicitly. At the same time, Kafka Connect is also a Kafka client that
uses ConsumerNetworkClient and Metadata for its rebalancing protocol. It's
not clear to me whether changes in those classes will affect Connect
workers. Do you think it's worth clarifying that?

Additionally, you might also want to add a section specifically to mention
how this new config affects the places where the current config
retry.backoff.ms is used today to back-off during rebalancing. Is
exponential backoff going to replace the old config in those places as
well? And if it does, should we add a mention that a very high value of the
new retry.backoff.max.ms might affect how quickly a consumer or worker
rejoins their group after it experiences a temporary network partitioning
from the broker coordinator?

Places that explicitly use retry.backoff.ms at the moment include the
AbstractCoordinator, the ConsumerCoordinator and the Heartbeat thread. By
reading the previous discussion, I understand that these classes might keep
using the old static backoff. Even if that's the case, I think it's worth
mentioning that in the KIP for reference.

In the rejected alternatives section, you mention that "existing behavior
is always maintained: for reasons explained in the compatibility section.".
However, the Compatibility section says that there are no compatibility
concerns. I'd suggest extending the compatibility section to help a bit
more in explaining why the alternatives were rejected. Also, in the
compatibility section you mention that the new config (retry.backoff.max.ms)
will replace the old one (retry.backoff.ms), but from reading at the
beginning, I understand that in order to have exponential increments, you
actually need both configs, with retry.backoff.ms < retry.backoff.max.ms.
Should the mention around replacement be removed?

Finally, I have a minor suggestion that might help explain the following
sentence better:

"If retry.backoff.ms is set to be greater than retry.backoff.max.ms, then
retry.backoff.max.ms will be used as a **constant backoff from the
beginning without exponential increase**." (highlighting the difference
only for reference here). Unless I misunderstood how the new backoff will
be used when it's smaller than the value of the old config, in which case
it might help clarifying a bit more as well.


Thanks for the KIP!
Really looking forward to more robust retries in Kafka clients

Konstantine


On Tue, Mar 24, 2020 at 9:56 AM Guozhang Wang <wangg...@gmail.com> wrote:

> In Kafka clients, there are cases where we log a warning when overriding
> some conflicting configs and in some other cases we throw and let the
> brokers to die during startup  --- you can check the
> `postProcessParsedConfig` function in Producer/ConsumerConfig for such
> logic.
>
> I think for this case, it is sufficient to log a warning if we find the
> `max` < `backoff`.
>
>
> Guozhang
>
> On Mon, Mar 23, 2020 at 9:18 PM Boyang Chen <reluctanthero...@gmail.com>
> wrote:
>
> > Got it, although I would still like to be aware of the actual backoff I
> > will be using in production, having the app crash seems like an
> > over-reaction. I don't think I have further questions :)
> >
> > On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <skaundi...@gmail.com>
> > wrote:
> >
> > > Hey Sanjana,
> > >
> > > Hey Boyang,
> > >
> > > If a user provides no config at all then as you mentioned they will be
> > > default be able to make use of the exponential back off feature
> > introduced
> > > by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of
> > > either
> > > the max or the computed back off will be chosen, so in this case the
> max
> > > will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> > > configures something like this then they would notice the behavior to
> not
> > > be in line what they expect and would see the KIP + Release notes and
> > know
> > > to configure it to be backoff.ms < max backoff.ms. I’m not sure if its
> > as
> > > big of an error to reject the configuration if it’s configured like
> this,
> > > as the clients could still run in either case.
> > >
> > > To answer your second question, we are making the dynamic backoff the
> > > default and not allowing for static backoff (unless they set
> backoff.ms
> > >
> > > max.backof.ms, then that would in a sense be static) We will include
> > this
> > > information in the release notes to make sure users are aware of this
> > > behavior change.
> > >
> > > Thanks,
> > > Sanjana
> > >
> > > On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <
> reluctanthero...@gmail.com>
> > > wrote:
> > >
> > > > Hey Sanjana,
> > > >
> > > > my understanding with the update is that if a user provides no config
> > at
> > > > all, a Producer/Consumer/Admin client user would by default enjoying
> a
> > > > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I
> > > already
> > > > override the backoff.ms to 2000 ms for instance, will I be choosing
> > the
> > > > default max.backoff here?
> > > >
> > > > I guess my question would be whether we should just reject a config
> > with
> > > > backoff.ms > max.backoff.ms in the first place, as this looks like
> > > > mis-configuration to me.
> > > >
> > > > Second question is whether we allow fallback to static backoffs if
> the
> > > user
> > > > wants to do so, or we should just ship this as an opt-in feature?
> > > >
> > > > Let me know your thoughts.
> > > >
> > > > Boyang
> > > >
> > > > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <c...@confluent.io>
> wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <
> > skaundi...@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > > > >
> > > > > > Thanks,
> > > > > > Sanjana
> > > > > >
> > > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <
> wangg...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Following the formula you have in the KIP, if it is simply:
> > > > > >>
> > > > > >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures -
> 1))
> > *
> > > > > random(
> > > > > >> 0.8, 1.2))
> > > > > >>
> > > > > >> then the behavior would stay consistent at retry.backoff.max.ms
> .
> > > > > >>
> > > > > >>
> > > > > >> Guozhang
> > > > > >>
> > > > > >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > > > skaundi...@gmail.com
> > > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> If that’s the case then what should we base the starting point
> > as?
> > > > > >>> Currently in the KIP the starting point is retry.backoff.ms
> and
> > it
> > > > > >>> exponentially goes up to retry.backoff.max.ms. If
> > > > retry.backoff.max.ms
> > > > > >> is
> > > > > >>> smaller than retry.backoff.ms then that could pose a bit of a
> > > > problem
> > > > > >>> there right?
> > > > > >>>
> > > > > >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <
> > wangg...@gmail.com
> > > >,
> > > > > >> wrote:
> > > > > >>>> Thanks Sanjana, I did not capture the part that Jason referred
> > to,
> > > > so
> > > > > >>>> that's my bad :P
> > > > > >>>>
> > > > > >>>> Regarding your last statement, I actually feel that instead of
> > > take
> > > > > the
> > > > > >>>> larger of the two, we should respect "retry.backoff.max.ms"
> > even
> > > if
> > > > > it
> > > > > >>> is
> > > > > >>>> smaller than "retry.backoff.ms". I do not have a very strong
> > > > > rationale
> > > > > >>>> except it is logically more aligned to the config names.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Guozhang
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > > > >> skaundi...@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Hey Jason and Guozhang,
> > > > > >>>>>
> > > > > >>>>> Jason is right, I took this inspiration from KIP-144 (
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > > >>>>> )
> > > > > >>>>> which had the same logic in order to preserve the existing
> > > > behavior.
> > > > > >> In
> > > > > >>>>> this case however, if we are thinking to completely eliminate
> > the
> > > > > >>> static
> > > > > >>>>> backoff behavior, we can do that and as Jason mentioned put
> it
> > in
> > > > the
> > > > > >>>>> release notes and not add any special logic. In addition I
> > agree
> > > > that
> > > > > >>> we
> > > > > >>>>> should take the larger of the two of `retry.backoff.ms` and
> `
> > > > > >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this
> and
> > > > make
> > > > > >> it
> > > > > >>>>> clear that the old static retry backoff is getting replaced
> by
> > > the
> > > > > >> new
> > > > > >>>>> dynamic retry backoff.
> > > > > >>>>>
> > > > > >>>>> Thanks,
> > > > > >>>>> Sanjana
> > > > > >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > > > ja...@confluent.io>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hey Guozhang,
> > > > > >>>>>>
> > > > > >>>>>> I was referring to this:
> > > > > >>>>>>
> > > > > >>>>>>> For users who have not set retry.backoff.ms explicitly,
> the
> > > > > >>> default
> > > > > >>>>>> behavior will change so that the backoff will grow up to
> 1000
> > > ms.
> > > > > >> For
> > > > > >>>>> users
> > > > > >>>>>> who have set retry.backoff.ms explicitly, the behavior will
> > > > remain
> > > > > >>> the
> > > > > >>>>>> same
> > > > > >>>>>> as they could have specific requirements.
> > > > > >>>>>>
> > > > > >>>>>> I took this to mean that for users who have overridden `
> > > > > >>> retry.backoff.ms
> > > > > >>>>> `
> > > > > >>>>>> to 50ms (say), we will change the default `
> > retry.backoff.max.ms
> > > `
> > > > > >> to
> > > > > >>> 50ms
> > > > > >>>>>> as
> > > > > >>>>>> well in order to preserve existing backoff behavior. Is that
> > not
> > > > > >>> right?
> > > > > >>>>> In
> > > > > >>>>>> any case, I agree that we can use the maximum of the two
> > values
> > > as
> > > > > >>> the
> > > > > >>>>>> effective `retry.backoff.max.ms` to handle the case when
> the
> > > > > >>> configured
> > > > > >>>>>> value of `retry.backoff.ms` is larger than the default of
> 1s.
> > > > > >>>>>>
> > > > > >>>>>> -Jason
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> > > wangg...@gmail.com
> > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hey Jason,
> > > > > >>>>>>>
> > > > > >>>>>>> My understanding is a bit different here: even if user has
> an
> > > > > >>> explicit
> > > > > >>>>>>> overridden "retry.backoff.ms", the exponential mechanism
> > still
> > > > > >>>>> triggers
> > > > > >>>>>>> and
> > > > > >>>>>>> the backoff would be increased till "retry.backoff.max.ms
> ";
> > > and
> > > > > >>> if the
> > > > > >>>>>>> specified "retry.backoff.ms" is already larger than the "
> > > > > >>>>>>> retry.backoff.max.ms", we would still take "
> > > retry.backoff.max.ms
> > > > > >> ".
> > > > > >>>>>>>
> > > > > >>>>>>> So if the user does override the "retry.backoff.ms" to a
> > value
> > > > > >>> larger
> > > > > >>>>>> than
> > > > > >>>>>>> 1s and is not aware of the new config, she would be
> surprised
> > > to
> > > > > >>> see
> > > > > >>>>> the
> > > > > >>>>>>> specified value seemingly not being respected, but she
> could
> > > > > >> still
> > > > > >>>>> learn
> > > > > >>>>>>> that afterwards by reading the release notes introducing
> this
> > > KIP
> > > > > >>>>>> anyways.
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Guozhang
> > > > > >>>>>>>
> > > > > >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > > > >>> ja...@confluent.io>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Hi Sanjana,
> > > > > >>>>>>>>
> > > > > >>>>>>>> The KIP looks good to me. I had just one question about
> the
> > > > > >>> default
> > > > > >>>>>>>> behavior. As I understand, if the user has specified `
> > > > > >>>>> retry.backoff.ms
> > > > > >>>>>> `
> > > > > >>>>>>>> explicitly, then we will not apply the default max
> backoff.
> > As
> > > > > >>> such,
> > > > > >>>>>>>> there's no way to get the benefit of this feature if you
> are
> > > > > >>>>> providing
> > > > > >>>>>> a
> > > > > >>>>>>> `
> > > > > >>>>>>>> retry.backoff.ms` unless you also provide `
> > > > > >> retry.backoff.max.ms
> > > > > >>> `.
> > > > > >>>>> That
> > > > > >>>>>>>> makes sense if you assume the user is unaware of the new
> > > > > >>>>> configuration,
> > > > > >>>>>>> but
> > > > > >>>>>>>> it is surprising otherwise. Since it's not a semantic
> change
> > > > > >> and
> > > > > >>>>> since
> > > > > >>>>>>> the
> > > > > >>>>>>>> default you're proposing of 1s is fairly low already, I
> > wonder
> > > > > >> if
> > > > > >>>>> it's
> > > > > >>>>>>> good
> > > > > >>>>>>>> enough to mention the new configuration in the release
> notes
> > > > > >> and
> > > > > >>> not
> > > > > >>>>>> add
> > > > > >>>>>>>> any special logic. What do you think?
> > > > > >>>>>>>>
> > > > > >>>>>>>> -Jason
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > > >>>>>> skaundi...@gmail.com>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Thank you for the comments Guozhang.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I’ll leave this KIP out for discussion till the end of
> the
> > > > > >>> week and
> > > > > >>>>>>> then
> > > > > >>>>>>>>> start a vote for this early next week.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Sanjana
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > > > > >>> wangg...@gmail.com
> > > > > >>>>>> ,
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>>> Hello Sanjana,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
> > > > > >>> sense --
> > > > > >>>>>> as
> > > > > >>>>>>>> you
> > > > > >>>>>>>>>> mentioned in the motivation, we've indeed seen many
> issues
> > > > > >>> with
> > > > > >>>>>>> regard
> > > > > >>>>>>>> to
> > > > > >>>>>>>>>> the frequent retries, with bounded exponential backoff
> in
> > > > > >> the
> > > > > >>>>>>> scenario
> > > > > >>>>>>>>>> where there's a long connectivity issue we would
> > > > > >> effectively
> > > > > >>>>> reduce
> > > > > >>>>>>> the
> > > > > >>>>>>>>>> request load by 10 given the default configs.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> For higher-level Streams client and Connect frameworks,
> > > > > >>> today we
> > > > > >>>>>> also
> > > > > >>>>>>>>> have
> > > > > >>>>>>>>>> a retry logic but that's used in a slightly different
> way.
> > > > > >>> For
> > > > > >>>>>>> example
> > > > > >>>>>>>> in
> > > > > >>>>>>>>>> Streams, we tend to handle the retry logic at the
> > > > > >>> thread-level
> > > > > >>>>> and
> > > > > >>>>>>>> hence
> > > > > >>>>>>>>>> very likely we'd like to change that mechanism in
> KIP-572
> > > > > >>>>> anyways.
> > > > > >>>>>>> For
> > > > > >>>>>>>>>> producer / consumer / admin clients, I think just
> applying
> > > > > >>> this
> > > > > >>>>>>>>> behavioral
> > > > > >>>>>>>>>> change across these clients makes lot of sense. So I
> think
> > > > > >>> can
> > > > > >>>>> just
> > > > > >>>>>>>> leave
> > > > > >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
> > > > > >>>>> addressed
> > > > > >>>>>> in
> > > > > >>>>>>>>>> separate discussions.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I do not have further comments about this KIP :) LGTM.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Guozhang
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > >>>>>>>> skaundi...@gmail.com
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks for the feedback Boyang.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> If there’s anyone else who has feedback regarding this
> > > > > >> KIP,
> > > > > >>>>> would
> > > > > >>>>>>>>> really
> > > > > >>>>>>>>>>> appreciate it hearing it!
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>> Sanjana
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > > >>>>>> bche...@outlook.com>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Sounds great!
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > >>>>>>>>>>>> ________________________________
> > > > > >>>>>>>>>>>> From: Sanjana Kaundinya <skaundi...@gmail.com>
> > > > > >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > >>>>>>>>>>>> To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff
> for
> > > > > >>> Kafka
> > > > > >>>>>>>> Clients
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
> > > > > >> common
> > > > > >>>>>>> problems
> > > > > >>>>>>>>> that
> > > > > >>>>>>>>>>> we
> > > > > >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
> > > > > >>>>> example,
> > > > > >>>>>>> if
> > > > > >>>>>>>>> there
> > > > > >>>>>>>>>>> is
> > > > > >>>>>>>>>>>> a broker failure, all clients start to fetch metadata
> > > > > >> at
> > > > > >>> the
> > > > > >>>>>> same
> > > > > >>>>>>>>> time
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>> it often takes a while for the metadata to converge.
> > > > > >> In a
> > > > > >>>>> high
> > > > > >>>>>>> load
> > > > > >>>>>>>>>>>> cluster, there are also issues where the volume of
> > > > > >>> metadata
> > > > > >>>>> has
> > > > > >>>>>>>> made
> > > > > >>>>>>>>>>>> convergence of metadata slower.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> For this case, exponential backoff helps as it reduces
> > > > > >>> the
> > > > > >>>>>> retry
> > > > > >>>>>>>>> rate and
> > > > > >>>>>>>>>>>> spaces out how often clients will retry, thereby
> > > > > >> bringing
> > > > > >>>>> down
> > > > > >>>>>>> the
> > > > > >>>>>>>>> time
> > > > > >>>>>>>>>>> for
> > > > > >>>>>>>>>>>> convergence. Something that Jason mentioned that would
> > > > > >>> be a
> > > > > >>>>>> great
> > > > > >>>>>>>>>>> addition
> > > > > >>>>>>>>>>>> here would be if the backoff should be “jittered” as
> it
> > > > > >>> was
> > > > > >>>>> in
> > > > > >>>>>>>>> KIP-144
> > > > > >>>>>>>>>>> with
> > > > > >>>>>>>>>>>> respect to exponential reconnect backoff. This would
> > > > > >> help
> > > > > >>>>>> prevent
> > > > > >>>>>>>> the
> > > > > >>>>>>>>>>>> clients from being synchronized on when they retry,
> > > > > >>> thereby
> > > > > >>>>>>> spacing
> > > > > >>>>>>>>> out
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>> number of requests being sent to the broker at the
> same
> > > > > >>> time.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
> > > > > >>> the
> > > > > >>>>>>> details
> > > > > >>>>>>>> -
> > > > > >>>>>>>>> so
> > > > > >>>>>>>>>>>> it’s more clear.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > >>>>>>>>> reluctanthero...@gmail.com
> > > > > >>>>>>>>>>>> ,
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
> > > > > >>>>>> rephrase
> > > > > >>>>>>> my
> > > > > >>>>>>>>>>>> question
> > > > > >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
> > > > > >>>>> unactionable.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> My specific point is that exponential backoff is not
> > > > > >> a
> > > > > >>>>> silver
> > > > > >>>>>>>>> bullet
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>> should consider using it to solve known problems,
> > > > > >>> instead
> > > > > >>>>> of
> > > > > >>>>>>>>> making the
> > > > > >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
> > > > > >>> do
> > > > > >>>>> like
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>> exponential backoff idea and believe this would be of
> > > > > >>> great
> > > > > >>>>>>>> value,
> > > > > >>>>>>>>> but
> > > > > >>>>>>>>>>>>> maybe we should focus on proposing some existing
> > > > > >>> modules
> > > > > >>>>> that
> > > > > >>>>>>> are
> > > > > >>>>>>>>>>>> suffering
> > > > > >>>>>>>>>>>>> from static retry, and only change them in this first
> > > > > >>> KIP.
> > > > > >>>>> If
> > > > > >>>>>>> in
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>> future, some other component users believe they are
> > > > > >>> also
> > > > > >>>>>>>>> suffering, we
> > > > > >>>>>>>>>>>>> could get more minor KIPs to change the behavior as
> > > > > >>> well.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Boyang
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > >>>>>>>>>>> skaundi...@gmail.com
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
> > > > > >> KIP
> > > > > >>>>> with
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>>> mathematical relations as per your suggestion. To
> > > > > >>> address
> > > > > >>>>>>> your
> > > > > >>>>>>>>>>>> feedback:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
> > > > > >>>>> backoff,
> > > > > >>>>>>> in
> > > > > >>>>>>>> 1
> > > > > >>>>>>>>>>> second
> > > > > >>>>>>>>>>>>>> we would have 10 retries. In the case of using an
> > > > > >>>>>> exponential
> > > > > >>>>>>>>>>> backoff,
> > > > > >>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
> > > > > >> we
> > > > > >>> have
> > > > > >>>>>>> less
> > > > > >>>>>>>>> than
> > > > > >>>>>>>>>>>> half of
> > > > > >>>>>>>>>>>>>> the amount of retries in the same timeframe and can
> > > > > >>>>> lessen
> > > > > >>>>>>>> broker
> > > > > >>>>>>>>>>>> pressure.
> > > > > >>>>>>>>>>>>>> This calculation is done as following (using the
> > > > > >>> formula
> > > > > >>>>>> laid
> > > > > >>>>>>>>> out in
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> KIP:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
> > > > > >>> ms
> > > > > >>>>>>> (default
> > > > > >>>>>>>>> retry
> > > > > >>>>>>>>>>>> ms
> > > > > >>>>>>>>>>>>>> is initially 100 ms)
> > > > > >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
> > > > > >>> 200 ms
> > > > > >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
> > > > > >>> 400 ms
> > > > > >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
> > > > > >>> 800 ms
> > > > > >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
> > > > > >>> 1000
> > > > > >>>>> ms
> > > > > >>>>>>>>> (default
> > > > > >>>>>>>>>>> max
> > > > > >>>>>>>>>>>>>> retry ms is 1000 ms)
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
> > > > > >> you
> > > > > >>> mean
> > > > > >>>>>>> with
> > > > > >>>>>>>>>>> respect
> > > > > >>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
> > > > > >>> Streams
> > > > > >>>>>>>>> framework, so
> > > > > >>>>>>>>>>>> would
> > > > > >>>>>>>>>>>>>> love to get more insight to how that currently
> > > > > >> works,
> > > > > >>>>> with
> > > > > >>>>>>>>> respect to
> > > > > >>>>>>>>>>>>>> producer transactions, so I can appropriately
> > > > > >> update
> > > > > >>> the
> > > > > >>>>>> KIP
> > > > > >>>>>>> to
> > > > > >>>>>>>>>>> address
> > > > > >>>>>>>>>>>>>> these scenarios.
> > > > > >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > >>>>>>>>>>>> reluctanthero...@gmail.com>,
> > > > > >>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
> > > > > >> motivation
> > > > > >>> is
> > > > > >>>>>> good,
> > > > > >>>>>>>> but
> > > > > >>>>>>>>>>> lack
> > > > > >>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>> more quantitative analysis. For instance:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
> > > > > >>>>>>> exponential
> > > > > >>>>>>>>> retry
> > > > > >>>>>>>>>>>> vs
> > > > > >>>>>>>>>>>>>>> static retry? There should be some mathematical
> > > > > >>>>> relations
> > > > > >>>>>>>>> between
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> static retry ms, the initial exponential retry
> > > > > >> ms,
> > > > > >>> the
> > > > > >>>>>> max
> > > > > >>>>>>>>>>>> exponential
> > > > > >>>>>>>>>>>>>>> retry ms in a given time interval.
> > > > > >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
> > > > > >>>>>>> exponential
> > > > > >>>>>>>>> retry,
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> client shall be getting easier to timeout on a
> > > > > >>> parent
> > > > > >>>>>> level
> > > > > >>>>>>>>> caller,
> > > > > >>>>>>>>>>>> for
> > > > > >>>>>>>>>>>>>>> instance stream attempts to retry initializing
> > > > > >>> producer
> > > > > >>>>>>>>>>> transactions
> > > > > >>>>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
> > > > > >>> this
> > > > > >>>>>>>> mechanism
> > > > > >>>>>>>>>>> could
> > > > > >>>>>>>>>>>>>>> experience more frequent timeout which we should
> > > > > >> be
> > > > > >>>>>> careful
> > > > > >>>>>>>>> with.
> > > > > >>>>>>>>>>>>>>> 3. With regards to #2, we should have more
> > > > > >> detailed
> > > > > >>>>>>> checklist
> > > > > >>>>>>>>> of
> > > > > >>>>>>>>>>> all
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
> > > > > >>> initial
> > > > > >>>>>>>>> exponential
> > > > > >>>>>>>>>>>> retry
> > > > > >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
> > > > > >> high
> > > > > >>>>> level
> > > > > >>>>>>> due
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>>> too
> > > > > >>>>>>>>>>>> few
> > > > > >>>>>>>>>>>>>>> attempts.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Boyang
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> > > > > >> Kaundinya <
> > > > > >>>>>>>>>>>> skaundi...@gmail.com>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi Everyone,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I’ve written a KIP about introducing
> > > > > >> exponential
> > > > > >>>>>> backoff
> > > > > >>>>>>>> for
> > > > > >>>>>>>>>>> Kafka
> > > > > >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>> Sanjana
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> -- Guozhang
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> -- Guozhang
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> -- Guozhang
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> -- Guozhang
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>

Reply via email to