Hey Jason,

Thanks again. I've updated the KIP.

P

On Fri, Apr 21, 2023 at 9:56 AM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> Hey Philip, that sounds good to me.
>
> -Jason
>
>
>
> On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <philip...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > Thanks for the response. I agree with your suggestions. Just to clarify,
> we
> > want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
> > resolution time, and we can throw a fatal, BootstrapResolutionException
> (so
> > not connection exception anymore) afterward.
> >
> > I think that aligns with the goal of this KIP.
> >
> > P
> >
> > On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson
> <ja...@confluent.io.invalid
> > >
> > wrote:
> >
> > > Hey Philip,
> > >
> > > Yeah, I see your point. Here is the challenge I'm considering. Today,
> the
> > > client does not forward connection-related errors to the application.
> It
> > is
> > > debatable whether it should, but that is the behavior that applications
> > > expect today. The only case of a fatal error is DNS resolution and the
> > > application must handle it because they cannot even construct the
> client.
> > > Once we have this KIP, we will introduce a separate fatal failure
> > mechanism
> > > through the `BootstrapConnectionException`. And we will use it not only
> > for
> > > DNS failures, but general bootstrap connection failures . The problem
> is
> > > that existing applications will not know that this should be treated
> as a
> > > fatal error. So they may continue to retry. Does it help in this
> > situation
> > > to poison the client so that it cannot be used? I'm not sure. Perhaps
> it
> > > would reduce the risk a bit if we change `
> > bootstrap.connection.timeout.ms`
> > > to `bootstrap.resolve.timeout.ms` or something like that? Then we
> retain
> > > the current behavior if DNS succeeds, but connections fail. I know I'm
> > the
> > > one that suggested generalizing the configuration, but I feel some
> > > hesitation after thinking about it more.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <philip...@gmail.com>
> wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > Thanks for your review.  I think if we make it a retriable error,
> does
> > it
> > > > make sense to have a configurable timeout still? as we expect the
> user
> > to
> > > > continue to retry anyway.
> > > >
> > > > I'm considering the case of bad configuration. If the user retries
> the
> > > > error, then we rely on the error/warning to alert the user.  In this
> > > case,
> > > > maybe we continue using the proposed behavior, i.e. warn on each poll
> > > after
> > > > the timeout period.
> > > >
> > > > If you agree that a configuration is needed, maybe we can call this
> > > > *bootstrap.auto.retry.ms
> > > > <http://bootstrap.auto.retry.ms> *instead, to indicate a
> configurable
> > > > period of automatic retry. What do you think?
> > > >
> > > > Cheers,
> > > > P
> > > >
> > > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> > > <ja...@confluent.io.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Hey Phillip,
> > > > >
> > > > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I
> do
> > > > wonder
> > > > > if it is necessary to treat bootstrap timeout as a fatal error
> > though.
> > > It
> > > > > seems possible that the exception might be caught by handlers in
> > > existing
> > > > > applications which may not expect that the client needs to be
> > > restarted.
> > > > > Perhaps it would be safer to make it retriable? As long as the
> > > > application
> > > > > continues trying to use the client, we could continue trying to
> reach
> > > the
> > > > > bootstrap servers perhaps? That would be closer to behavior today
> > which
> > > > > only treats DNS resolution failures as fatal. What do you think?
> > > > >
> > > > > Best,
> > > > > Jason
> > > > >
> > > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <philip...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Thanks, everyone: I'm starting a vote today.  Here's the recap
> for
> > > some
> > > > > of
> > > > > > the questions:
> > > > > >
> > > > > > John: I changed the proposal to throw a non-retriable exception
> > after
> > > > the
> > > > > > timeout elapses. I feel it might be necessary to poison the
> client
> > > > after
> > > > > > retry expires, as it might indicate a real issue.
> > > > > > Ismael: The proposal is to add a configuration for the retry and
> it
> > > > will
> > > > > > throw a non-retriable exception after the time expires.
> > > > > > Chris: Addressed some unclarity that you mentioned, and a new API
> > > won't
> > > > > be
> > > > > > introduced in this KIP.  Maybe up for future discussion.
> > > > > > Jason: I'm proposing adding a timeout config and a bootstrap
> > > exception
> > > > > per
> > > > > > your suggestion.
> > > > > > Kirk: I'm proposing throwing a non-retriable exception in the
> > network
> > > > > > client. See previous comment.
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > > <chr...@aiven.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Philip,
> > > > > > >
> > > > > > > Yeah,  "DNS resolution should occur..." seems like a better
> fit.
> > 👍
> > > > > > >
> > > > > > > One other question I have is whether we should expose some kind
> > of
> > > > > public
> > > > > > > API for performing preflight validation of the bootstrap URLs.
> If
> > > we
> > > > > > change
> > > > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > > > "loclahost instead of localhost") from failing in the
> constructor
> > > to
> > > > > > > failing with a retriable exception, this might lead some client
> > > > > > > applications to handle that failure by, well, retrying. For
> > > > reference,
> > > > > > this
> > > > > > > is exactly what we do in Kafka Connect right now; see [1] and
> > [2].
> > > > IMO
> > > > > > it'd
> > > > > > > be nice to be able to opt into keeping the current behavior so
> > that
> > > > > > > projects like Connect could still do preflight checks of the
> > > > > > > bootstrap.servers property for connectors before starting them,
> > and
> > > > > > report
> > > > > > > any issues by failing fast instead of continuously writing
> > > > > warning/error
> > > > > > > messages to their logs.
> > > > > > >
> > > > > > > I'm not sure about where this new API could go, but a few
> options
> > > > might
> > > > > > be:
> > > > > > >
> > > > > > > - Expose a public variant of the existing ClientUtils class
> > > > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > > > AdminClientConfig classes
> > > > > > > - Add those same static methods to the KafkaConsumer,
> > > KafkaProducer,
> > > > > and
> > > > > > > KafkaAdminClient classes
> > > > > > >
> > > > > > > If this seems reasonable, we should probably also specify in
> the
> > > KIP
> > > > > that
> > > > > > > Kafka Connect will leverage this preflight validation logic
> > before
> > > > > > > instantiating any Kafka clients for use by connectors or tasks,
> > and
> > > > > > > continue to fail fast if there are typos in the
> bootstrap.servers
> > > > > > property,
> > > > > > > or if temporary DNS resolution issues come up.
> > > > > > >
> > > > > > > [1] -
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > > > [2] -
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <
> philip...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hey Chris,
> > > > > > > >
> > > > > > > > Thanks for the quick response, and I apologize for the
> unclear
> > > > > wording
> > > > > > > > there, I guess "DNS lookup" would be a more appropriate
> wording
> > > > here.
> > > > > > So
> > > > > > > > what I meant there was, to delegate the DNS lookup in the
> > > > constructor
> > > > > > to
> > > > > > > > the network client poll, and it will happen on the very first
> > > poll.
> > > > > I
> > > > > > > > guess the logic could look like this:
> > > > > > > >
> > > > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> > server
> > > > > > > address.
> > > > > > > >
> > > > > > > > Thanks for the comment there, I'll change up the wording.
> > Maybe
> > > > > revise
> > > > > > > it
> > > > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > > > >
> > > > > > > > P
> > > > > > > >
> > > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > > > <chr...@aiven.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Philip,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP!
> > > > > > > > >
> > > > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > > > "Bootstrapping
> > > > > > > > > should now occur in the poll method before attempting to
> > update
> > > > the
> > > > > > > > > metadata. This includes resolving the addresses and
> > > bootstrapping
> > > > > the
> > > > > > > > > metadata.". By "bootstrapping the metadata" do we mean
> > actually
> > > > > > > > contacting
> > > > > > > > > the bootstrap servers, or just setting some internal state
> > > > related
> > > > > to
> > > > > > > the
> > > > > > > > > current set of servers that can be contacted for metadata?
> I
> > > ask
> > > > > > > because
> > > > > > > > it
> > > > > > > > > seems like the language here implies the former, but if
> > that's
> > > > the
> > > > > > > case,
> > > > > > > > > this is already happening in poll (or at least, the first
> > > > > invocation
> > > > > > of
> > > > > > > > > it), and if it's the latter, it's probably not necessary to
> > > > mention
> > > > > > in
> > > > > > > > the
> > > > > > > > > KIP since it doesn't really impact user-facing behavior. It
> > > also
> > > > > > seems
> > > > > > > > like
> > > > > > > > > that detail might impact how intertwined this and KIP-899
> > are,
> > > > > though
> > > > > > > the
> > > > > > > > > similarity could still be superficial either way.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > > philip...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Ismael,
> > > > > > > > > >
> > > > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > > > automatically
> > > > > > > but
> > > > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > > > consumer.poll)
> > > > > > > > > to
> > > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > > > NetworkException
> > > > > > > > > > (retriable) will be thrown.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > P
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > > ism...@juma.me.uk
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how
> long
> > > > will
> > > > > we
> > > > > > > > retry
> > > > > > > > > > for
> > > > > > > > > > > and when do we give up and propagate the failure to the
> > > user?
> > > > > > > > > > >
> > > > > > > > > > > Ismael
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > > > philip...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all!
> > > > > > > > > > > >
> > > > > > > > > > > > I want to start a discussion thread about how we can
> > > handle
> > > > > > > client
> > > > > > > > > > > > bootstrap failure due DNS lookup.  This requires a
> bit
> > of
> > > > > > > > behavioral
> > > > > > > > > > > > change, so a KIP is proposed and attached to this
> > email.
> > > > Let
> > > > > me
> > > > > > > > know
> > > > > > > > > > what
> > > > > > > > > > > > you think!
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > *A small remark here*: *As the title of this KIP
> might
> > > > sound
> > > > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > > > >
> > > > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > > > existing
> > > > > > > > > bootstrap
> > > > > > > > > > > > (upon instantiation) strategy because it is
> reasonable
> > to
> > > > > allow
> > > > > > > > > clients
> > > > > > > > > > > to
> > > > > > > > > > > > retry
> > > > > > > > > > > >
> > > > > > > > > > > > *KIP: *
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > Philip
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to