Hi Philip,

> Having an overall timeout also seems reasonable, but I wonder what should
the client do after running out of the time? Should we throw a
non-retriable exception (instead of TimeoutExceptoin to stop the client
from retrying) and alert the user to examine the config and the DNS server?

Yeah, not sure exactly. I'd probably suggest a
`BootstrapConnectionException` or something like that with a clear message
indicating the problem. What the user does with it is up to them, but at
least it gives them the option to fail their application if that is what
they prefer to do in this case. If they catch it and ignore it, I would
expect the client to just continue retrying. Logging for bootstrap
dns/connection failures will be helpful in any case.

-Jason






On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <philip...@gmail.com> wrote:

> Jason:
> Thanks for the feedback.  Now giving it a second thought, I think your
> suggestion of logging the error might make sense, as I could imagine most
> users would just continue to retry, so it might not be necessary to throw
> an exception anyway.
> Having an overall timeout also seems reasonable, but I wonder what should
> the client do after running out of the time? Should we throw a
> non-retriable exception (instead of TimeoutExceptoin to stop the client
> from retrying) and alert the user to examine the config and the DNS server?
>
> Chris:
> I feel I still haven't answered your question about the pre-flight check,
> as it seems exposing an API might be harder to push through.
>
> Thanks!
> P
>
> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
> <ja...@confluent.io.invalid>
> wrote:
>
> > One more random thought I had just as I pushed send. We're currently
> > treating this problem somewhat narrowly by focusing only on the DNS
> > resolution of the bootstrap servers. Even if the servers resolve, there's
> > no guarantee that they are reachable by the client. Would it make sense
> to
> > have a timeout which bounds the total time that the client should wait to
> > connect to the bootstrap servers? Something like `
> > bootstrap.servers.connection.timeout.ms`.
> >
> > -Jason
> >
> > On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Philip,
> > >
> > > An alternative is not to fail at all. Every other network error is
> caught
> > > and handled internally in the client. We see this case as different
> > because
> > > a DNS resolution error may imply misconfiguration. Could it also imply
> > that
> > > the DNS server is unavailable? I'm not sure why that case should be
> > handled
> > > differently than if the bootstrap servers themselves are unavailable.
> > Would
> > > it be enough to log a clear warning in the logs if the bootstrap
> servers
> > > could not resolve?
> > >
> > > On the whole, the current fail-fast approach probably does more good
> than
> > > bad, but it does seem somewhat inconsistent overall and my guess is
> that
> > > dynamic environments will become increasingly common. It would be nice
> to
> > > have a reasonable default behavior which could handle these cases
> > > gracefully without any additional logic. In any case, it would be nice
> to
> > > see this option in the rejected alternatives at least if we do not take
> > it.
> > >
> > > If we want to take the route of throwing an exception, then I think
> we're
> > > probably going to need a new configuration since I can't see what a
> > > reasonable timeout we would use as a default. The benefit of a
> > > configuration is that it would let us retain the current default
> behavior
> > > with timeout effectively set to 0 and it would also let users
> effectively
> > > disable the timeout by using a very large value. Otherwise, it seems
> > like a
> > > potential compatibility break to have a new exception type thrown at
> some
> > > arbitrary time without giving the user any control over it.
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton <chr...@aiven.io.invalid
> >
> > > wrote:
> > >
> > >> Hi Philip,
> > >>
> > >> Yeah, it's basically DNS resolution we're talking about, though
> there's
> > >> some additional subtlety there with the logic introduced by KIP-235
> [1].
> > >> Essentially it should cover any scenario that causes a client
> > constructor
> > >> to fail with the current logic but would not after this KIP is
> released.
> > >>
> > >> We can generalize the Connect use case like this: a client application
> > >> that
> > >> may connect to different Kafka clusters with a public-facing,
> > easy-to-use
> > >> API for restarting failed tasks and automatic handling of retriable
> > >> exceptions. The ease with which failed tasks can be restarted is
> > >> significant because it reduces the cost of failing on non-retriable
> > >> exceptions and makes fail-fast behavior easier to work with. And, in
> > cases
> > >> like this where we can't really know whether the error we're dealing
> > with
> > >> is retriable or not, it's better IMO to continue to allow applications
> > >> like
> > >> these to fail fast. I do agree that it'd be nice to get input from the
> > >> community, though.
> > >>
> > >> I was toying with the idea of a NetworkException subclass too. It's a
> > >> simpler API, but it doesn't allow for preflight validation, which can
> be
> > >> useful in scenarios where submitting new configurations for client
> > >> applications is expensive in terms of time or resources. Then again, I
> > >> don't see why the two are mutually exclusive, and we might opt to use
> > the
> > >> NetworkException subclass in this KIP and pursue an opt-in validation
> > API
> > >> later on. Thoughts?
> > >>
> > >> [1] -
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> > >>
> > >> Cheers,
> > >>
> > >> Chris
> > >>
> > >> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <philip...@gmail.com>
> wrote:
> > >>
> > >> > Hey Chris,
> > >> >
> > >> > Thanks again for the feedback!
> > >> >
> > >> >
> > >> > For the preflight DNS check (are we basically trying to resolve the
> > DNS
> > >> > there?): Maybe it makes more sense to add it to the Config modules?
> I
> > >> would
> > >> > like to hear what the community says as I'm not familiar with the
> > >> Connect
> > >> > use case.
> > >> >
> > >> > A "slower failing" alternative - I wonder if it makes sense for us
> to
> > >> > extend the NetworkException so that clients can be smarter at
> handling
> > >> > these exceptions. Of course, it is still retriable and requires
> > polling
> > >> the
> > >> > consumer, but then we can distinguish the DNS resolution error from
> > >> other
> > >> > network errors.
> > >> >
> > >> > Thanks!
> > >> > P
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> <chr...@aiven.io.invalid
> > >
> > >> > wrote:
> > >> >
> > >> > > Hi Philip,
> > >> > >
> > >> > > Yeah,  "DNS resolution should occur..." seems like a better fit.
> 👍
> > >> > >
> > >> > > One other question I have is whether we should expose some kind of
> > >> public
> > >> > > API for performing preflight validation of the bootstrap URLs. If
> we
> > >> > change
> > >> > > the behavior of a client configured with a silly typo (e.g.,
> > >> > > "loclahost instead of localhost") from failing in the constructor
> to
> > >> > > failing with a retriable exception, this might lead some client
> > >> > > applications to handle that failure by, well, retrying. For
> > reference,
> > >> > this
> > >> > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> > IMO
> > >> > it'd
> > >> > > be nice to be able to opt into keeping the current behavior so
> that
> > >> > > projects like Connect could still do preflight checks of the
> > >> > > bootstrap.servers property for connectors before starting them,
> and
> > >> > report
> > >> > > any issues by failing fast instead of continuously writing
> > >> warning/error
> > >> > > messages to their logs.
> > >> > >
> > >> > > I'm not sure about where this new API could go, but a few options
> > >> might
> > >> > be:
> > >> > >
> > >> > > - Expose a public variant of the existing ClientUtils class
> > >> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > >> > > AdminClientConfig classes
> > >> > > - Add those same static methods to the KafkaConsumer,
> KafkaProducer,
> > >> and
> > >> > > KafkaAdminClient classes
> > >> > >
> > >> > > If this seems reasonable, we should probably also specify in the
> KIP
> > >> that
> > >> > > Kafka Connect will leverage this preflight validation logic before
> > >> > > instantiating any Kafka clients for use by connectors or tasks,
> and
> > >> > > continue to fail fast if there are typos in the bootstrap.servers
> > >> > property,
> > >> > > or if temporary DNS resolution issues come up.
> > >> > >
> > >> > > [1] -
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > >> > > [2] -
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > >> > >
> > >> > > Cheers,
> > >> > >
> > >> > > Chris
> > >> > >
> > >> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hey Chris,
> > >> > > >
> > >> > > > Thanks for the quick response, and I apologize for the unclear
> > >> wording
> > >> > > > there, I guess "DNS lookup" would be a more appropriate wording
> > >> here.
> > >> > So
> > >> > > > what I meant there was, to delegate the DNS lookup in the
> > >> constructor
> > >> > to
> > >> > > > the network client poll, and it will happen on the very first
> > >> poll.  I
> > >> > > > guess the logic could look like this:
> > >> > > >
> > >> > > > - if the client has been bootstrapped, do nothing.
> > >> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> server
> > >> > > address.
> > >> > > >
> > >> > > > Thanks for the comment there, I'll change up the wording.  Maybe
> > >> revise
> > >> > > it
> > >> > > > as "DNS resolution should occur in the poll...." ?
> > >> > > >
> > >> > > > P
> > >> > > >
> > >> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > >> <chr...@aiven.io.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Philip,
> > >> > > > >
> > >> > > > > Thanks for the KIP!
> > >> > > > >
> > >> > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > >> > > "Bootstrapping
> > >> > > > > should now occur in the poll method before attempting to
> update
> > >> the
> > >> > > > > metadata. This includes resolving the addresses and
> > bootstrapping
> > >> the
> > >> > > > > metadata.". By "bootstrapping the metadata" do we mean
> actually
> > >> > > > contacting
> > >> > > > > the bootstrap servers, or just setting some internal state
> > >> related to
> > >> > > the
> > >> > > > > current set of servers that can be contacted for metadata? I
> ask
> > >> > > because
> > >> > > > it
> > >> > > > > seems like the language here implies the former, but if that's
> > the
> > >> > > case,
> > >> > > > > this is already happening in poll (or at least, the first
> > >> invocation
> > >> > of
> > >> > > > > it), and if it's the latter, it's probably not necessary to
> > >> mention
> > >> > in
> > >> > > > the
> > >> > > > > KIP since it doesn't really impact user-facing behavior. It
> also
> > >> > seems
> > >> > > > like
> > >> > > > > that detail might impact how intertwined this and KIP-899 are,
> > >> though
> > >> > > the
> > >> > > > > similarity could still be superficial either way.
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > >
> > >> > > > > Chris
> > >> > > > >
> > >> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> philip...@gmail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hey Ismael,
> > >> > > > > >
> > >> > > > > > Thanks for the feedback! The proposal is not to retry
> > >> automatically
> > >> > > but
> > >> > > > > > relies on the user polling the NetworkClient (basically,
> > >> > > consumer.poll)
> > >> > > > > to
> > >> > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > >> NetworkException
> > >> > > > > > (retriable) will be thrown.
> > >> > > > > >
> > >> > > > > > Thanks!
> > >> > > > > > P
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > ism...@juma.me.uk>
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> > >> will we
> > >> > > > retry
> > >> > > > > > for
> > >> > > > > > > and when do we give up and propagate the failure to the
> > user?
> > >> > > > > > >
> > >> > > > > > > Ismael
> > >> > > > > > >
> > >> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > >> philip...@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi all!
> > >> > > > > > > >
> > >> > > > > > > > I want to start a discussion thread about how we can
> > handle
> > >> > > client
> > >> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit
> of
> > >> > > > behavioral
> > >> > > > > > > > change, so a KIP is proposed and attached to this email.
> > >> Let me
> > >> > > > know
> > >> > > > > > what
> > >> > > > > > > > you think!
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > *A small remark here*: *As the title of this KIP might
> > sound
> > >> > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > >> > > > > > > >
> > >> > > > > > > > *In Summary:* I want to propose a KIP to change the
> > existing
> > >> > > > > bootstrap
> > >> > > > > > > > (upon instantiation) strategy because it is reasonable
> to
> > >> allow
> > >> > > > > clients
> > >> > > > > > > to
> > >> > > > > > > > retry
> > >> > > > > > > >
> > >> > > > > > > > *KIP: *
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > >> > > > > > > >
> > >> > > > > > > > Thanks!
> > >> > > > > > > > Philip
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to