Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Yash Mayya Wed, 01 Mar 2023 05:06:54 -0800

Hi all,

Thanks for all the feedback and discussion. I've renamed the KIP title to
"Kafka Connect REST API timeout improvements" since we're introducing a
couple of improvements (cancelling create / update connector requests when
config validations timeout and avoiding double config validations in
distributed mode) along with making the request timeouts configurable. The
new KIP link is
https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
and I've called for a vote for the KIP here -
https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.


Thanks,
Yash

On Sat, Nov 5, 2022 at 11:42 PM Sagar <[email protected]> wrote:

> Hey Yash,
>
> Thanks for the explanation. I think it should be fine to delegate the
> validation directly to the leader.
>
> Thanks!
> Sagar.
>
> On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <[email protected]> wrote:
>
> > Hi Sagar,
> >
> > Thanks for chiming in!
> >
> > > Having said that, why does the worker forward to the
> > > leader? I am thinking if the worker can perform the validation on it's
> > own,
> > > we could let it do the validation instead of forwarding everything to
> the
> > > leader
> >
> > Only the leader is allowed to perform writes to the config topic, so any
> > request that requires a config topic write ends up being forwarded to the
> > leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
> > endpoints call `Herder::putConnectorConfig` which internally does a
> config
> > validation first before initiating another herder request to write to the
> > config topic (in distributed mode). If we want to allow the first worker
> to
> > do the config validation, and then forward the request to the leader just
> > for the write to the config topic, we'd either need something like a skip
> > validations query parameter on the endpoint like Chris talks about above
> or
> > else a new internal only endpoint that just does a write to the config
> > topic without any prior config validation. However, we agreed that this
> > optimization doesn't really seem necessary for now and can be done later
> if
> > deemed useful. I'd be happy to hear differing thoughts if any, however.
> >
> > > I think a bound is certainly needed but IMO it shouldn't go beyond
> > > 10 mins considering this is just validation
> >
> > Yeah, I agree that this seems like a fair value - I've elected to go
> with a
> > default value of 10 minutes for the proposed worker configuration that
> sets
> > an upper bound for the timeout query parameter.
> >
> > Thanks,
> > Yash
> >
> > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <[email protected]> wrote:
> >
> > > Hi Chris,
> > >
> > > Thanks again for your feedback. I think a worker configuration for the
> > > upper bound makes sense - I initially thought we could hardcode it
> (just
> > > like the current request timeout is), but there's no reason to set
> > another
> > > artificial bound that isn't user configurable which is exactly what
> we're
> > > trying to change here in the first place. I've updated the KIP based on
> > all
> > > our discussion above.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <[email protected]>
> wrote:
> > >
> > >> Hey Yash,
> > >>
> > >> Thanks for the KIP! This looks like a useful feature.
> > >>
> > >> I think the discussion thread already has some great points by Chris.
> > Just
> > >> a couple of points/clarifications=>
> > >>
> > >> Regarding, pt#2 , I guess it might be better to forward to the leader
> as
> > >> suggested by Yash. Having said that, why does the worker forward to
> the
> > >> leader? I am thinking if the worker can perform the validation on it's
> > >> own,
> > >> we could let it do the validation instead of forwarding everything to
> > the
> > >> leader(even though it might be cheap to forward all requests to the
> > >> leader).
> > >>
> > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go
> > beyond
> > >> 10 mins considering this is just validation. We shouldn't end up in a
> > >> situation where a few faulty connectors end up blocking a lot of
> request
> > >> processing threads, so while increasing the config is certainly
> helpful,
> > >> we
> > >> shouldn't set too high a value IMO. Of course I am also open to
> > >> suggestions
> > >> here.
> > >>
> > >> Thanks!
> > >> Sagar.
> > >>
> > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <[email protected]
> >
> > >> wrote:
> > >>
> > >> > Hi Yash,
> > >> >
> > >> > RE 2: That's a great point about validations already being performed
> > by
> > >> the
> > >> > leader. For completeness's sake, I'd like to note that this only
> holds
> > >> for
> > >> > valid configurations; invalid ones are caught right now before being
> > >> > forwarded to the leader. Still, I think it's fine to forward to the
> > >> leader
> > >> > for now and optimize further in the future if necessary. If frequent
> > >> > validations are taking place they should be conducted via the `PUT
> > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which
> won't
> > >> do
> > >> > any forwarding at all.
> > >> >
> > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also
> > >> seem
> > >> > reasonable... maybe a low-importance worker property could work for
> > >> that?
> > >> > Not sure what would make sense for a default; probably somewhere in
> > the
> > >> > 10-60 minute range but would be interested in others' thoughts.
> > >> >
> > >> > Thanks for the clarification on the zombie fencing logic. I think we
> > >> might
> > >> > want to have some more subtle logic around the interaction between
> > >> calls to
> > >> > Admin::fenceProducers and a worker-level timeout property if we go
> > that
> > >> > route, but we can cross that particular bridge if we get back to it.
> > >> >
> > >> > Cheers,
> > >> >
> > >> > Chris
> > >> >
> > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <[email protected]>
> > wrote:
> > >> >
> > >> > > Hi Chris,
> > >> > >
> > >> > > Thanks a lot for the super quick response and the great feedback!
> > >> > >
> > >> > > 1. I think that makes a lot of sense, and I'd be happy to update
> the
> > >> KIP
> > >> > to
> > >> > > include this change in the scope. The current behavior where the
> API
> > >> > > response indicates a time out but the connector is created/updated
> > >> > > eventually anyway can be pretty confusing and is generally not a
> > good
> > >> > user
> > >> > > experience IMO.
> > >> > >
> > >> > > 2. Wow, thanks for pointing this out - it's a really good catch
> and
> > >> > > something I hadn't noticed was happening with the current
> > >> implementation.
> > >> > > While I do like the idea of having a query parameter that
> determines
> > >> > > whether validations can be skipped, I'm wondering if it might not
> be
> > >> > easier
> > >> > > and cleaner to just do the leader check earlier and avoid doing
> the
> > >> > > unnecessary config validation on the first worker? Since each
> config
> > >> > > validation happens on its own thread, I'm not so sure about the
> > >> concern
> > >> > of
> > >> > > overloading the leader even on larger clusters, especially since
> > >> > > validations aren't typically long running operations. Furthermore,
> > >> even
> > >> > > with the current implementation, the leader will always be doing a
> > >> config
> > >> > > validation for connector create / update REST API requests on any
> > >> worker.
> > >> > >
> > >> > > 3. That's a good point, and this way we can also restrict the APIs
> > >> whose
> > >> > > timeouts are configurable - I'm thinking `PUT
> > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> /connectors`
> > >> and
> > >> > > `PUT /connectors/{connector}/config` are the ones where such a
> > timeout
> > >> > > parameter could be useful. Also, do you think we should enforce
> some
> > >> > > reasonable bounds for the timeout config?
> > >> > >
> > >> > > On the zombie fencing point, the implication was that the new
> worker
> > >> > > property would not control the timeout used for the call to
> > >> > > Admin::fenceProducers. However, if we go with a timeout query
> > >> parameter
> > >> > > approach, even the timeout for the `PUT
> > /connectors/{connector}/fence'
> > >> > > endpoint will remain unaffected.
> > >> > >
> > >> > > Thanks,
> > >> > > Yash
> > >> > >
> > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > <[email protected]
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Yash,
> > >> > > >
> > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
> > >> > hesitant
> > >> > > > to support cases where connector validation takes this long, but
> > >> > > > considering the alternative is that we give users a 500 error
> > >> response
> > >> > > but
> > >> > > > leave the request to create/modify the connector queued up in
> the
> > >> > > herder, I
> > >> > > > think I can get behind the motivation here. There's also an
> > >> argument to
> > >> > > be
> > >> > > > made about keeping Kafka Connect available even when the systems
> > >> that
> > >> > it
> > >> > > > connects to are in a degraded state.
> > >> > > >
> > >> > > > I have a few alternatives I'd be interested in your thoughts on:
> > >> > > >
> > >> > > > 1. Since the primary concern here seems to be that custom
> > connector
> > >> > > > validation logic can take too long, do we have any thoughts on
> > >> adding
> > >> > > logic
> > >> > > > to check for request timeout after validation has completed and,
> > if
> > >> it
> > >> > > has,
> > >> > > > aborting the attempt to create/modify the connector?
> > >> > > >
> > >> > > > 2. Right now it's possible that we'll perform two connector
> config
> > >> > > > validations per create/modify request; once on the worker that
> > >> > initially
> > >> > > > receives the request, and then again if that worker is not the
> > >> leader
> > >> > of
> > >> > > > the cluster and has to forward the request to the leader. Any
> > >> thoughts
> > >> > on
> > >> > > > optimizing this to only require a single validation per request?
> > We
> > >> > > > probably wouldn't want to force all validations to take place on
> > the
> > >> > > leader
> > >> > > > (could lead to overloading it pretty quickly in large clusters),
> > >> but we
> > >> > > > could add an internal-only query parameter to skip validation
> and
> > >> then
> > >> > > use
> > >> > > > that parameter when forwarding requests from followers to the
> > >> leader.
> > >> > > >
> > >> > > > 3. A worker property is pretty coarse-grained, and difficult to
> > >> change.
> > >> > > We
> > >> > > > might allow per-request toggling of the timeout by adding a URL
> > >> query
> > >> > > > parameter like '?timeout=90s' to the REST API to allow tweaking
> of
> > >> the
> > >> > > > timeout on a more granular basis, and without having to perform
> a
> > >> > worker
> > >> > > > restart.
> > >> > > >
> > >> > > > I'd also like to clarify a point about the rejected alternative
> > >> "Allow
> > >> > > > configuring producer zombie fencing admin request timeout"--is
> the
> > >> > > > implication here that the "rest.api.request.timeout.ms"
> property
> > >> will
> > >> > > not
> > >> > > > control the REST timeout for requests to the 'PUT
> > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> > >> control
> > >> > > the
> > >> > > > timeout that we use for the call to Admin::fenceProducers?
> > >> > > >
> > >> > > > Cheers,
> > >> > > >
> > >> > > > Chris
> > >> > > >
> > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> [email protected]>
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > I'd like to start a discussion thread on this small KIP -
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > >> > > > >
> > >> > > > > It proposes the addition of a new Kafka Connect worker
> > >> configuration
> > >> > to
> > >> > > > > allow configuring REST API request timeouts.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Yash
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Reply via email to