Hey Jason,

Thanks for the response. I agree with your suggestions. Just to clarify, we
want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
resolution time, and we can throw a fatal, BootstrapResolutionException (so
not connection exception anymore) afterward.

I think that aligns with the goal of this KIP.

P

On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> Hey Philip,
>
> Yeah, I see your point. Here is the challenge I'm considering. Today, the
> client does not forward connection-related errors to the application. It is
> debatable whether it should, but that is the behavior that applications
> expect today. The only case of a fatal error is DNS resolution and the
> application must handle it because they cannot even construct the client.
> Once we have this KIP, we will introduce a separate fatal failure mechanism
> through the `BootstrapConnectionException`. And we will use it not only for
> DNS failures, but general bootstrap connection failures . The problem is
> that existing applications will not know that this should be treated as a
> fatal error. So they may continue to retry. Does it help in this situation
> to poison the client so that it cannot be used? I'm not sure. Perhaps it
> would reduce the risk a bit if we change `bootstrap.connection.timeout.ms`
> to `bootstrap.resolve.timeout.ms` or something like that? Then we retain
> the current behavior if DNS succeeds, but connections fail. I know I'm the
> one that suggested generalizing the configuration, but I feel some
> hesitation after thinking about it more.
>
> Thanks,
> Jason
>
> On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <philip...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > Thanks for your review.  I think if we make it a retriable error, does it
> > make sense to have a configurable timeout still? as we expect the user to
> > continue to retry anyway.
> >
> > I'm considering the case of bad configuration. If the user retries the
> > error, then we rely on the error/warning to alert the user.  In this
> case,
> > maybe we continue using the proposed behavior, i.e. warn on each poll
> after
> > the timeout period.
> >
> > If you agree that a configuration is needed, maybe we can call this
> > *bootstrap.auto.retry.ms
> > <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
> > period of automatic retry. What do you think?
> >
> > Cheers,
> > P
> >
> > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> <ja...@confluent.io.invalid
> > >
> > wrote:
> >
> > > Hey Phillip,
> > >
> > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do
> > wonder
> > > if it is necessary to treat bootstrap timeout as a fatal error though.
> It
> > > seems possible that the exception might be caught by handlers in
> existing
> > > applications which may not expect that the client needs to be
> restarted.
> > > Perhaps it would be safer to make it retriable? As long as the
> > application
> > > continues trying to use the client, we could continue trying to reach
> the
> > > bootstrap servers perhaps? That would be closer to behavior today which
> > > only treats DNS resolution failures as fatal. What do you think?
> > >
> > > Best,
> > > Jason
> > >
> > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <philip...@gmail.com>
> wrote:
> > >
> > > > Thanks, everyone: I'm starting a vote today.  Here's the recap for
> some
> > > of
> > > > the questions:
> > > >
> > > > John: I changed the proposal to throw a non-retriable exception after
> > the
> > > > timeout elapses. I feel it might be necessary to poison the client
> > after
> > > > retry expires, as it might indicate a real issue.
> > > > Ismael: The proposal is to add a configuration for the retry and it
> > will
> > > > throw a non-retriable exception after the time expires.
> > > > Chris: Addressed some unclarity that you mentioned, and a new API
> won't
> > > be
> > > > introduced in this KIP.  Maybe up for future discussion.
> > > > Jason: I'm proposing adding a timeout config and a bootstrap
> exception
> > > per
> > > > your suggestion.
> > > > Kirk: I'm proposing throwing a non-retriable exception in the network
> > > > client. See previous comment.
> > > >
> > > >
> > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> <chr...@aiven.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Philip,
> > > > >
> > > > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> > > > >
> > > > > One other question I have is whether we should expose some kind of
> > > public
> > > > > API for performing preflight validation of the bootstrap URLs. If
> we
> > > > change
> > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > "loclahost instead of localhost") from failing in the constructor
> to
> > > > > failing with a retriable exception, this might lead some client
> > > > > applications to handle that failure by, well, retrying. For
> > reference,
> > > > this
> > > > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> > IMO
> > > > it'd
> > > > > be nice to be able to opt into keeping the current behavior so that
> > > > > projects like Connect could still do preflight checks of the
> > > > > bootstrap.servers property for connectors before starting them, and
> > > > report
> > > > > any issues by failing fast instead of continuously writing
> > > warning/error
> > > > > messages to their logs.
> > > > >
> > > > > I'm not sure about where this new API could go, but a few options
> > might
> > > > be:
> > > > >
> > > > > - Expose a public variant of the existing ClientUtils class
> > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > AdminClientConfig classes
> > > > > - Add those same static methods to the KafkaConsumer,
> KafkaProducer,
> > > and
> > > > > KafkaAdminClient classes
> > > > >
> > > > > If this seems reasonable, we should probably also specify in the
> KIP
> > > that
> > > > > Kafka Connect will leverage this preflight validation logic before
> > > > > instantiating any Kafka clients for use by connectors or tasks, and
> > > > > continue to fail fast if there are typos in the bootstrap.servers
> > > > property,
> > > > > or if temporary DNS resolution issues come up.
> > > > >
> > > > > [1] -
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > [2] -
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Chris,
> > > > > >
> > > > > > Thanks for the quick response, and I apologize for the unclear
> > > wording
> > > > > > there, I guess "DNS lookup" would be a more appropriate wording
> > here.
> > > > So
> > > > > > what I meant there was, to delegate the DNS lookup in the
> > constructor
> > > > to
> > > > > > the network client poll, and it will happen on the very first
> poll.
> > > I
> > > > > > guess the logic could look like this:
> > > > > >
> > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > > > > address.
> > > > > >
> > > > > > Thanks for the comment there, I'll change up the wording.  Maybe
> > > revise
> > > > > it
> > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > >
> > > > > > P
> > > > > >
> > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > <chr...@aiven.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Philip,
> > > > > > >
> > > > > > > Thanks for the KIP!
> > > > > > >
> > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > "Bootstrapping
> > > > > > > should now occur in the poll method before attempting to update
> > the
> > > > > > > metadata. This includes resolving the addresses and
> bootstrapping
> > > the
> > > > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > > > > contacting
> > > > > > > the bootstrap servers, or just setting some internal state
> > related
> > > to
> > > > > the
> > > > > > > current set of servers that can be contacted for metadata? I
> ask
> > > > > because
> > > > > > it
> > > > > > > seems like the language here implies the former, but if that's
> > the
> > > > > case,
> > > > > > > this is already happening in poll (or at least, the first
> > > invocation
> > > > of
> > > > > > > it), and if it's the latter, it's probably not necessary to
> > mention
> > > > in
> > > > > > the
> > > > > > > KIP since it doesn't really impact user-facing behavior. It
> also
> > > > seems
> > > > > > like
> > > > > > > that detail might impact how intertwined this and KIP-899 are,
> > > though
> > > > > the
> > > > > > > similarity could still be superficial either way.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> philip...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hey Ismael,
> > > > > > > >
> > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > automatically
> > > > > but
> > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > consumer.poll)
> > > > > > > to
> > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > NetworkException
> > > > > > > > (retriable) will be thrown.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > P
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> ism...@juma.me.uk
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> > will
> > > we
> > > > > > retry
> > > > > > > > for
> > > > > > > > > and when do we give up and propagate the failure to the
> user?
> > > > > > > > >
> > > > > > > > > Ismael
> > > > > > > > >
> > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > philip...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all!
> > > > > > > > > >
> > > > > > > > > > I want to start a discussion thread about how we can
> handle
> > > > > client
> > > > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > > > > behavioral
> > > > > > > > > > change, so a KIP is proposed and attached to this email.
> > Let
> > > me
> > > > > > know
> > > > > > > > what
> > > > > > > > > > you think!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *A small remark here*: *As the title of this KIP might
> > sound
> > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > >
> > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > existing
> > > > > > > bootstrap
> > > > > > > > > > (upon instantiation) strategy because it is reasonable to
> > > allow
> > > > > > > clients
> > > > > > > > > to
> > > > > > > > > > retry
> > > > > > > > > >
> > > > > > > > > > *KIP: *
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > Philip
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to