Hey Philip,

Yeah, I see your point. Here is the challenge I'm considering. Today, the
client does not forward connection-related errors to the application. It is
debatable whether it should, but that is the behavior that applications
expect today. The only case of a fatal error is DNS resolution and the
application must handle it because they cannot even construct the client.
Once we have this KIP, we will introduce a separate fatal failure mechanism
through the `BootstrapConnectionException`. And we will use it not only for
DNS failures, but general bootstrap connection failures . The problem is
that existing applications will not know that this should be treated as a
fatal error. So they may continue to retry. Does it help in this situation
to poison the client so that it cannot be used? I'm not sure. Perhaps it
would reduce the risk a bit if we change `bootstrap.connection.timeout.ms`
to `bootstrap.resolve.timeout.ms` or something like that? Then we retain
the current behavior if DNS succeeds, but connections fail. I know I'm the
one that suggested generalizing the configuration, but I feel some
hesitation after thinking about it more.

Thanks,
Jason

On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <philip...@gmail.com> wrote:

> Hey Jason,
>
> Thanks for your review.  I think if we make it a retriable error, does it
> make sense to have a configurable timeout still? as we expect the user to
> continue to retry anyway.
>
> I'm considering the case of bad configuration. If the user retries the
> error, then we rely on the error/warning to alert the user.  In this case,
> maybe we continue using the proposed behavior, i.e. warn on each poll after
> the timeout period.
>
> If you agree that a configuration is needed, maybe we can call this
> *bootstrap.auto.retry.ms
> <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
> period of automatic retry. What do you think?
>
> Cheers,
> P
>
> On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson <ja...@confluent.io.invalid
> >
> wrote:
>
> > Hey Phillip,
> >
> > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do
> wonder
> > if it is necessary to treat bootstrap timeout as a fatal error though. It
> > seems possible that the exception might be caught by handlers in existing
> > applications which may not expect that the client needs to be restarted.
> > Perhaps it would be safer to make it retriable? As long as the
> application
> > continues trying to use the client, we could continue trying to reach the
> > bootstrap servers perhaps? That would be closer to behavior today which
> > only treats DNS resolution failures as fatal. What do you think?
> >
> > Best,
> > Jason
> >
> > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <philip...@gmail.com> wrote:
> >
> > > Thanks, everyone: I'm starting a vote today.  Here's the recap for some
> > of
> > > the questions:
> > >
> > > John: I changed the proposal to throw a non-retriable exception after
> the
> > > timeout elapses. I feel it might be necessary to poison the client
> after
> > > retry expires, as it might indicate a real issue.
> > > Ismael: The proposal is to add a configuration for the retry and it
> will
> > > throw a non-retriable exception after the time expires.
> > > Chris: Addressed some unclarity that you mentioned, and a new API won't
> > be
> > > introduced in this KIP.  Maybe up for future discussion.
> > > Jason: I'm proposing adding a timeout config and a bootstrap exception
> > per
> > > your suggestion.
> > > Kirk: I'm proposing throwing a non-retriable exception in the network
> > > client. See previous comment.
> > >
> > >
> > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <chr...@aiven.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Philip,
> > > >
> > > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> > > >
> > > > One other question I have is whether we should expose some kind of
> > public
> > > > API for performing preflight validation of the bootstrap URLs. If we
> > > change
> > > > the behavior of a client configured with a silly typo (e.g.,
> > > > "loclahost instead of localhost") from failing in the constructor to
> > > > failing with a retriable exception, this might lead some client
> > > > applications to handle that failure by, well, retrying. For
> reference,
> > > this
> > > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> IMO
> > > it'd
> > > > be nice to be able to opt into keeping the current behavior so that
> > > > projects like Connect could still do preflight checks of the
> > > > bootstrap.servers property for connectors before starting them, and
> > > report
> > > > any issues by failing fast instead of continuously writing
> > warning/error
> > > > messages to their logs.
> > > >
> > > > I'm not sure about where this new API could go, but a few options
> might
> > > be:
> > > >
> > > > - Expose a public variant of the existing ClientUtils class
> > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > AdminClientConfig classes
> > > > - Add those same static methods to the KafkaConsumer, KafkaProducer,
> > and
> > > > KafkaAdminClient classes
> > > >
> > > > If this seems reasonable, we should probably also specify in the KIP
> > that
> > > > Kafka Connect will leverage this preflight validation logic before
> > > > instantiating any Kafka clients for use by connectors or tasks, and
> > > > continue to fail fast if there are typos in the bootstrap.servers
> > > property,
> > > > or if temporary DNS resolution issues come up.
> > > >
> > > > [1] -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > [2] -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Chris,
> > > > >
> > > > > Thanks for the quick response, and I apologize for the unclear
> > wording
> > > > > there, I guess "DNS lookup" would be a more appropriate wording
> here.
> > > So
> > > > > what I meant there was, to delegate the DNS lookup in the
> constructor
> > > to
> > > > > the network client poll, and it will happen on the very first poll.
> > I
> > > > > guess the logic could look like this:
> > > > >
> > > > > - if the client has been bootstrapped, do nothing.
> > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > > > address.
> > > > >
> > > > > Thanks for the comment there, I'll change up the wording.  Maybe
> > revise
> > > > it
> > > > > as "DNS resolution should occur in the poll...." ?
> > > > >
> > > > > P
> > > > >
> > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > <chr...@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Philip,
> > > > > >
> > > > > > Thanks for the KIP!
> > > > > >
> > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > "Bootstrapping
> > > > > > should now occur in the poll method before attempting to update
> the
> > > > > > metadata. This includes resolving the addresses and bootstrapping
> > the
> > > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > > > contacting
> > > > > > the bootstrap servers, or just setting some internal state
> related
> > to
> > > > the
> > > > > > current set of servers that can be contacted for metadata? I ask
> > > > because
> > > > > it
> > > > > > seems like the language here implies the former, but if that's
> the
> > > > case,
> > > > > > this is already happening in poll (or at least, the first
> > invocation
> > > of
> > > > > > it), and if it's the latter, it's probably not necessary to
> mention
> > > in
> > > > > the
> > > > > > KIP since it doesn't really impact user-facing behavior. It also
> > > seems
> > > > > like
> > > > > > that detail might impact how intertwined this and KIP-899 are,
> > though
> > > > the
> > > > > > similarity could still be superficial either way.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <philip...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hey Ismael,
> > > > > > >
> > > > > > > Thanks for the feedback! The proposal is not to retry
> > automatically
> > > > but
> > > > > > > relies on the user polling the NetworkClient (basically,
> > > > consumer.poll)
> > > > > > to
> > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > NetworkException
> > > > > > > (retriable) will be thrown.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > P
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <ism...@juma.me.uk
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> will
> > we
> > > > > retry
> > > > > > > for
> > > > > > > > and when do we give up and propagate the failure to the user?
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > philip...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all!
> > > > > > > > >
> > > > > > > > > I want to start a discussion thread about how we can handle
> > > > client
> > > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > > > behavioral
> > > > > > > > > change, so a KIP is proposed and attached to this email.
> Let
> > me
> > > > > know
> > > > > > > what
> > > > > > > > > you think!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *A small remark here*: *As the title of this KIP might
> sound
> > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > >
> > > > > > > > > *In Summary:* I want to propose a KIP to change the
> existing
> > > > > > bootstrap
> > > > > > > > > (upon instantiation) strategy because it is reasonable to
> > allow
> > > > > > clients
> > > > > > > > to
> > > > > > > > > retry
> > > > > > > > >
> > > > > > > > > *KIP: *
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Philip
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to