Hey Philip, that sounds good to me.

-Jason



On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <philip...@gmail.com> wrote:

> Hey Jason,
>
> Thanks for the response. I agree with your suggestions. Just to clarify, we
> want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
> resolution time, and we can throw a fatal, BootstrapResolutionException (so
> not connection exception anymore) afterward.
>
> I think that aligns with the goal of this KIP.
>
> P
>
> On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson <ja...@confluent.io.invalid
> >
> wrote:
>
> > Hey Philip,
> >
> > Yeah, I see your point. Here is the challenge I'm considering. Today, the
> > client does not forward connection-related errors to the application. It
> is
> > debatable whether it should, but that is the behavior that applications
> > expect today. The only case of a fatal error is DNS resolution and the
> > application must handle it because they cannot even construct the client.
> > Once we have this KIP, we will introduce a separate fatal failure
> mechanism
> > through the `BootstrapConnectionException`. And we will use it not only
> for
> > DNS failures, but general bootstrap connection failures . The problem is
> > that existing applications will not know that this should be treated as a
> > fatal error. So they may continue to retry. Does it help in this
> situation
> > to poison the client so that it cannot be used? I'm not sure. Perhaps it
> > would reduce the risk a bit if we change `
> bootstrap.connection.timeout.ms`
> > to `bootstrap.resolve.timeout.ms` or something like that? Then we retain
> > the current behavior if DNS succeeds, but connections fail. I know I'm
> the
> > one that suggested generalizing the configuration, but I feel some
> > hesitation after thinking about it more.
> >
> > Thanks,
> > Jason
> >
> > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <philip...@gmail.com> wrote:
> >
> > > Hey Jason,
> > >
> > > Thanks for your review.  I think if we make it a retriable error, does
> it
> > > make sense to have a configurable timeout still? as we expect the user
> to
> > > continue to retry anyway.
> > >
> > > I'm considering the case of bad configuration. If the user retries the
> > > error, then we rely on the error/warning to alert the user.  In this
> > case,
> > > maybe we continue using the proposed behavior, i.e. warn on each poll
> > after
> > > the timeout period.
> > >
> > > If you agree that a configuration is needed, maybe we can call this
> > > *bootstrap.auto.retry.ms
> > > <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
> > > period of automatic retry. What do you think?
> > >
> > > Cheers,
> > > P
> > >
> > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> > <ja...@confluent.io.invalid
> > > >
> > > wrote:
> > >
> > > > Hey Phillip,
> > > >
> > > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do
> > > wonder
> > > > if it is necessary to treat bootstrap timeout as a fatal error
> though.
> > It
> > > > seems possible that the exception might be caught by handlers in
> > existing
> > > > applications which may not expect that the client needs to be
> > restarted.
> > > > Perhaps it would be safer to make it retriable? As long as the
> > > application
> > > > continues trying to use the client, we could continue trying to reach
> > the
> > > > bootstrap servers perhaps? That would be closer to behavior today
> which
> > > > only treats DNS resolution failures as fatal. What do you think?
> > > >
> > > > Best,
> > > > Jason
> > > >
> > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <philip...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks, everyone: I'm starting a vote today.  Here's the recap for
> > some
> > > > of
> > > > > the questions:
> > > > >
> > > > > John: I changed the proposal to throw a non-retriable exception
> after
> > > the
> > > > > timeout elapses. I feel it might be necessary to poison the client
> > > after
> > > > > retry expires, as it might indicate a real issue.
> > > > > Ismael: The proposal is to add a configuration for the retry and it
> > > will
> > > > > throw a non-retriable exception after the time expires.
> > > > > Chris: Addressed some unclarity that you mentioned, and a new API
> > won't
> > > > be
> > > > > introduced in this KIP.  Maybe up for future discussion.
> > > > > Jason: I'm proposing adding a timeout config and a bootstrap
> > exception
> > > > per
> > > > > your suggestion.
> > > > > Kirk: I'm proposing throwing a non-retriable exception in the
> network
> > > > > client. See previous comment.
> > > > >
> > > > >
> > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > <chr...@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Philip,
> > > > > >
> > > > > > Yeah,  "DNS resolution should occur..." seems like a better fit.
> 👍
> > > > > >
> > > > > > One other question I have is whether we should expose some kind
> of
> > > > public
> > > > > > API for performing preflight validation of the bootstrap URLs. If
> > we
> > > > > change
> > > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > > "loclahost instead of localhost") from failing in the constructor
> > to
> > > > > > failing with a retriable exception, this might lead some client
> > > > > > applications to handle that failure by, well, retrying. For
> > > reference,
> > > > > this
> > > > > > is exactly what we do in Kafka Connect right now; see [1] and
> [2].
> > > IMO
> > > > > it'd
> > > > > > be nice to be able to opt into keeping the current behavior so
> that
> > > > > > projects like Connect could still do preflight checks of the
> > > > > > bootstrap.servers property for connectors before starting them,
> and
> > > > > report
> > > > > > any issues by failing fast instead of continuously writing
> > > > warning/error
> > > > > > messages to their logs.
> > > > > >
> > > > > > I'm not sure about where this new API could go, but a few options
> > > might
> > > > > be:
> > > > > >
> > > > > > - Expose a public variant of the existing ClientUtils class
> > > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > > AdminClientConfig classes
> > > > > > - Add those same static methods to the KafkaConsumer,
> > KafkaProducer,
> > > > and
> > > > > > KafkaAdminClient classes
> > > > > >
> > > > > > If this seems reasonable, we should probably also specify in the
> > KIP
> > > > that
> > > > > > Kafka Connect will leverage this preflight validation logic
> before
> > > > > > instantiating any Kafka clients for use by connectors or tasks,
> and
> > > > > > continue to fail fast if there are typos in the bootstrap.servers
> > > > > property,
> > > > > > or if temporary DNS resolution issues come up.
> > > > > >
> > > > > > [1] -
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > > [2] -
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hey Chris,
> > > > > > >
> > > > > > > Thanks for the quick response, and I apologize for the unclear
> > > > wording
> > > > > > > there, I guess "DNS lookup" would be a more appropriate wording
> > > here.
> > > > > So
> > > > > > > what I meant there was, to delegate the DNS lookup in the
> > > constructor
> > > > > to
> > > > > > > the network client poll, and it will happen on the very first
> > poll.
> > > > I
> > > > > > > guess the logic could look like this:
> > > > > > >
> > > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> server
> > > > > > address.
> > > > > > >
> > > > > > > Thanks for the comment there, I'll change up the wording.
> Maybe
> > > > revise
> > > > > > it
> > > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > > >
> > > > > > > P
> > > > > > >
> > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > > <chr...@aiven.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Philip,
> > > > > > > >
> > > > > > > > Thanks for the KIP!
> > > > > > > >
> > > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > > "Bootstrapping
> > > > > > > > should now occur in the poll method before attempting to
> update
> > > the
> > > > > > > > metadata. This includes resolving the addresses and
> > bootstrapping
> > > > the
> > > > > > > > metadata.". By "bootstrapping the metadata" do we mean
> actually
> > > > > > > contacting
> > > > > > > > the bootstrap servers, or just setting some internal state
> > > related
> > > > to
> > > > > > the
> > > > > > > > current set of servers that can be contacted for metadata? I
> > ask
> > > > > > because
> > > > > > > it
> > > > > > > > seems like the language here implies the former, but if
> that's
> > > the
> > > > > > case,
> > > > > > > > this is already happening in poll (or at least, the first
> > > > invocation
> > > > > of
> > > > > > > > it), and if it's the latter, it's probably not necessary to
> > > mention
> > > > > in
> > > > > > > the
> > > > > > > > KIP since it doesn't really impact user-facing behavior. It
> > also
> > > > > seems
> > > > > > > like
> > > > > > > > that detail might impact how intertwined this and KIP-899
> are,
> > > > though
> > > > > > the
> > > > > > > > similarity could still be superficial either way.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Chris
> > > > > > > >
> > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > philip...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Ismael,
> > > > > > > > >
> > > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > > automatically
> > > > > > but
> > > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > > consumer.poll)
> > > > > > > > to
> > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > > NetworkException
> > > > > > > > > (retriable) will be thrown.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > P
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > ism...@juma.me.uk
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> > > will
> > > > we
> > > > > > > retry
> > > > > > > > > for
> > > > > > > > > > and when do we give up and propagate the failure to the
> > user?
> > > > > > > > > >
> > > > > > > > > > Ismael
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > > philip...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all!
> > > > > > > > > > >
> > > > > > > > > > > I want to start a discussion thread about how we can
> > handle
> > > > > > client
> > > > > > > > > > > bootstrap failure due DNS lookup.  This requires a bit
> of
> > > > > > > behavioral
> > > > > > > > > > > change, so a KIP is proposed and attached to this
> email.
> > > Let
> > > > me
> > > > > > > know
> > > > > > > > > what
> > > > > > > > > > > you think!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *A small remark here*: *As the title of this KIP might
> > > sound
> > > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > > >
> > > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > > existing
> > > > > > > > bootstrap
> > > > > > > > > > > (upon instantiation) strategy because it is reasonable
> to
> > > > allow
> > > > > > > > clients
> > > > > > > > > > to
> > > > > > > > > > > retry
> > > > > > > > > > >
> > > > > > > > > > > *KIP: *
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > > Philip
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to