Hi all, Thanks for all the feedback and discussion. I've renamed the KIP title to "Kafka Connect REST API timeout improvements" since we're introducing a couple of improvements (cancelling create / update connector requests when config validations timeout and avoiding double config validations in distributed mode) along with making the request timeouts configurable. The new KIP link is https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements and I've called for a vote for the KIP here - https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
Thanks, Yash On Sat, Nov 5, 2022 at 11:42 PM Sagar <sagarmeansoc...@gmail.com> wrote: > Hey Yash, > > Thanks for the explanation. I think it should be fine to delegate the > validation directly to the leader. > > Thanks! > Sagar. > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <yash.ma...@gmail.com> wrote: > > > Hi Sagar, > > > > Thanks for chiming in! > > > > > Having said that, why does the worker forward to the > > > leader? I am thinking if the worker can perform the validation on it's > > own, > > > we could let it do the validation instead of forwarding everything to > the > > > leader > > > > Only the leader is allowed to perform writes to the config topic, so any > > request that requires a config topic write ends up being forwarded to the > > leader. The `POST /connectors` and `PUT /connectors/{connector}/config` > > endpoints call `Herder::putConnectorConfig` which internally does a > config > > validation first before initiating another herder request to write to the > > config topic (in distributed mode). If we want to allow the first worker > to > > do the config validation, and then forward the request to the leader just > > for the write to the config topic, we'd either need something like a skip > > validations query parameter on the endpoint like Chris talks about above > or > > else a new internal only endpoint that just does a write to the config > > topic without any prior config validation. However, we agreed that this > > optimization doesn't really seem necessary for now and can be done later > if > > deemed useful. I'd be happy to hear differing thoughts if any, however. > > > > > I think a bound is certainly needed but IMO it shouldn't go beyond > > > 10 mins considering this is just validation > > > > Yeah, I agree that this seems like a fair value - I've elected to go > with a > > default value of 10 minutes for the proposed worker configuration that > sets > > an upper bound for the timeout query parameter. > > > > Thanks, > > Yash > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.ma...@gmail.com> wrote: > > > > > Hi Chris, > > > > > > Thanks again for your feedback. I think a worker configuration for the > > > upper bound makes sense - I initially thought we could hardcode it > (just > > > like the current request timeout is), but there's no reason to set > > another > > > artificial bound that isn't user configurable which is exactly what > we're > > > trying to change here in the first place. I've updated the KIP based on > > all > > > our discussion above. > > > > > > Thanks, > > > Yash > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sagarmeansoc...@gmail.com> > wrote: > > > > > >> Hey Yash, > > >> > > >> Thanks for the KIP! This looks like a useful feature. > > >> > > >> I think the discussion thread already has some great points by Chris. > > Just > > >> a couple of points/clarifications=> > > >> > > >> Regarding, pt#2 , I guess it might be better to forward to the leader > as > > >> suggested by Yash. Having said that, why does the worker forward to > the > > >> leader? I am thinking if the worker can perform the validation on it's > > >> own, > > >> we could let it do the validation instead of forwarding everything to > > the > > >> leader(even though it might be cheap to forward all requests to the > > >> leader). > > >> > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go > > beyond > > >> 10 mins considering this is just validation. We shouldn't end up in a > > >> situation where a few faulty connectors end up blocking a lot of > request > > >> processing threads, so while increasing the config is certainly > helpful, > > >> we > > >> shouldn't set too high a value IMO. Of course I am also open to > > >> suggestions > > >> here. > > >> > > >> Thanks! > > >> Sagar. > > >> > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <chr...@aiven.io.invalid > > > > >> wrote: > > >> > > >> > Hi Yash, > > >> > > > >> > RE 2: That's a great point about validations already being performed > > by > > >> the > > >> > leader. For completeness's sake, I'd like to note that this only > holds > > >> for > > >> > valid configurations; invalid ones are caught right now before being > > >> > forwarded to the leader. Still, I think it's fine to forward to the > > >> leader > > >> > for now and optimize further in the future if necessary. If frequent > > >> > validations are taking place they should be conducted via the `PUT > > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which > won't > > >> do > > >> > any forwarding at all. > > >> > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also > > >> seem > > >> > reasonable... maybe a low-importance worker property could work for > > >> that? > > >> > Not sure what would make sense for a default; probably somewhere in > > the > > >> > 10-60 minute range but would be interested in others' thoughts. > > >> > > > >> > Thanks for the clarification on the zombie fencing logic. I think we > > >> might > > >> > want to have some more subtle logic around the interaction between > > >> calls to > > >> > Admin::fenceProducers and a worker-level timeout property if we go > > that > > >> > route, but we can cross that particular bridge if we get back to it. > > >> > > > >> > Cheers, > > >> > > > >> > Chris > > >> > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.ma...@gmail.com> > > wrote: > > >> > > > >> > > Hi Chris, > > >> > > > > >> > > Thanks a lot for the super quick response and the great feedback! > > >> > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to update > the > > >> KIP > > >> > to > > >> > > include this change in the scope. The current behavior where the > API > > >> > > response indicates a time out but the connector is created/updated > > >> > > eventually anyway can be pretty confusing and is generally not a > > good > > >> > user > > >> > > experience IMO. > > >> > > > > >> > > 2. Wow, thanks for pointing this out - it's a really good catch > and > > >> > > something I hadn't noticed was happening with the current > > >> implementation. > > >> > > While I do like the idea of having a query parameter that > determines > > >> > > whether validations can be skipped, I'm wondering if it might not > be > > >> > easier > > >> > > and cleaner to just do the leader check earlier and avoid doing > the > > >> > > unnecessary config validation on the first worker? Since each > config > > >> > > validation happens on its own thread, I'm not so sure about the > > >> concern > > >> > of > > >> > > overloading the leader even on larger clusters, especially since > > >> > > validations aren't typically long running operations. Furthermore, > > >> even > > >> > > with the current implementation, the leader will always be doing a > > >> config > > >> > > validation for connector create / update REST API requests on any > > >> worker. > > >> > > > > >> > > 3. That's a good point, and this way we can also restrict the APIs > > >> whose > > >> > > timeouts are configurable - I'm thinking `PUT > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST > /connectors` > > >> and > > >> > > `PUT /connectors/{connector}/config` are the ones where such a > > timeout > > >> > > parameter could be useful. Also, do you think we should enforce > some > > >> > > reasonable bounds for the timeout config? > > >> > > > > >> > > On the zombie fencing point, the implication was that the new > worker > > >> > > property would not control the timeout used for the call to > > >> > > Admin::fenceProducers. However, if we go with a timeout query > > >> parameter > > >> > > approach, even the timeout for the `PUT > > /connectors/{connector}/fence' > > >> > > endpoint will remain unaffected. > > >> > > > > >> > > Thanks, > > >> > > Yash > > >> > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton > > <chr...@aiven.io.invalid > > >> > > > >> > > wrote: > > >> > > > > >> > > > Hi Yash, > > >> > > > > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I was > > >> > hesitant > > >> > > > to support cases where connector validation takes this long, but > > >> > > > considering the alternative is that we give users a 500 error > > >> response > > >> > > but > > >> > > > leave the request to create/modify the connector queued up in > the > > >> > > herder, I > > >> > > > think I can get behind the motivation here. There's also an > > >> argument to > > >> > > be > > >> > > > made about keeping Kafka Connect available even when the systems > > >> that > > >> > it > > >> > > > connects to are in a degraded state. > > >> > > > > > >> > > > I have a few alternatives I'd be interested in your thoughts on: > > >> > > > > > >> > > > 1. Since the primary concern here seems to be that custom > > connector > > >> > > > validation logic can take too long, do we have any thoughts on > > >> adding > > >> > > logic > > >> > > > to check for request timeout after validation has completed and, > > if > > >> it > > >> > > has, > > >> > > > aborting the attempt to create/modify the connector? > > >> > > > > > >> > > > 2. Right now it's possible that we'll perform two connector > config > > >> > > > validations per create/modify request; once on the worker that > > >> > initially > > >> > > > receives the request, and then again if that worker is not the > > >> leader > > >> > of > > >> > > > the cluster and has to forward the request to the leader. Any > > >> thoughts > > >> > on > > >> > > > optimizing this to only require a single validation per request? > > We > > >> > > > probably wouldn't want to force all validations to take place on > > the > > >> > > leader > > >> > > > (could lead to overloading it pretty quickly in large clusters), > > >> but we > > >> > > > could add an internal-only query parameter to skip validation > and > > >> then > > >> > > use > > >> > > > that parameter when forwarding requests from followers to the > > >> leader. > > >> > > > > > >> > > > 3. A worker property is pretty coarse-grained, and difficult to > > >> change. > > >> > > We > > >> > > > might allow per-request toggling of the timeout by adding a URL > > >> query > > >> > > > parameter like '?timeout=90s' to the REST API to allow tweaking > of > > >> the > > >> > > > timeout on a more granular basis, and without having to perform > a > > >> > worker > > >> > > > restart. > > >> > > > > > >> > > > I'd also like to clarify a point about the rejected alternative > > >> "Allow > > >> > > > configuring producer zombie fencing admin request timeout"--is > the > > >> > > > implication here that the "rest.api.request.timeout.ms" > property > > >> will > > >> > > not > > >> > > > control the REST timeout for requests to the 'PUT > > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't > > >> control > > >> > > the > > >> > > > timeout that we use for the call to Admin::fenceProducers? > > >> > > > > > >> > > > Cheers, > > >> > > > > > >> > > > Chris > > >> > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya < > yash.ma...@gmail.com> > > >> > wrote: > > >> > > > > > >> > > > > Hi all, > > >> > > > > > > >> > > > > I'd like to start a discussion thread on this small KIP - > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW > > >> > > > > > > >> > > > > It proposes the addition of a new Kafka Connect worker > > >> configuration > > >> > to > > >> > > > > allow configuring REST API request timeouts. > > >> > > > > > > >> > > > > Thanks, > > >> > > > > Yash > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > >