Hi Chris,

Thanks a lot for the super quick response and the great feedback!

1. I think that makes a lot of sense, and I'd be happy to update the KIP to
include this change in the scope. The current behavior where the API
response indicates a time out but the connector is created/updated
eventually anyway can be pretty confusing and is generally not a good user
experience IMO.

2. Wow, thanks for pointing this out - it's a really good catch and
something I hadn't noticed was happening with the current implementation.
While I do like the idea of having a query parameter that determines
whether validations can be skipped, I'm wondering if it might not be easier
and cleaner to just do the leader check earlier and avoid doing the
unnecessary config validation on the first worker? Since each config
validation happens on its own thread, I'm not so sure about the concern of
overloading the leader even on larger clusters, especially since
validations aren't typically long running operations. Furthermore, even
with the current implementation, the leader will always be doing a config
validation for connector create / update REST API requests on any worker.

3. That's a good point, and this way we can also restrict the APIs whose
timeouts are configurable - I'm thinking `PUT
/connector-plugins/{pluginName}/config/validate`, `POST /connectors` and
`PUT /connectors/{connector}/config` are the ones where such a timeout
parameter could be useful. Also, do you think we should enforce some
reasonable bounds for the timeout config?

On the zombie fencing point, the implication was that the new worker
property would not control the timeout used for the call to
Admin::fenceProducers. However, if we go with a timeout query parameter
approach, even the timeout for the `PUT /connectors/{connector}/fence'
endpoint will remain unaffected.

Thanks,
Yash

On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

> Hi Yash,
>
> Thanks for the KIP. It's a nice, focused change. Initially I was hesitant
> to support cases where connector validation takes this long, but
> considering the alternative is that we give users a 500 error response but
> leave the request to create/modify the connector queued up in the herder, I
> think I can get behind the motivation here. There's also an argument to be
> made about keeping Kafka Connect available even when the systems that it
> connects to are in a degraded state.
>
> I have a few alternatives I'd be interested in your thoughts on:
>
> 1. Since the primary concern here seems to be that custom connector
> validation logic can take too long, do we have any thoughts on adding logic
> to check for request timeout after validation has completed and, if it has,
> aborting the attempt to create/modify the connector?
>
> 2. Right now it's possible that we'll perform two connector config
> validations per create/modify request; once on the worker that initially
> receives the request, and then again if that worker is not the leader of
> the cluster and has to forward the request to the leader. Any thoughts on
> optimizing this to only require a single validation per request? We
> probably wouldn't want to force all validations to take place on the leader
> (could lead to overloading it pretty quickly in large clusters), but we
> could add an internal-only query parameter to skip validation and then use
> that parameter when forwarding requests from followers to the leader.
>
> 3. A worker property is pretty coarse-grained, and difficult to change. We
> might allow per-request toggling of the timeout by adding a URL query
> parameter like '?timeout=90s' to the REST API to allow tweaking of the
> timeout on a more granular basis, and without having to perform a worker
> restart.
>
> I'd also like to clarify a point about the rejected alternative "Allow
> configuring producer zombie fencing admin request timeout"--is the
> implication here that the "rest.api.request.timeout.ms" property will not
> control the REST timeout for requests to the 'PUT
> /connectors/{connector}/fence' endpoint, or just that it won't control the
> timeout that we use for the call to Admin::fenceProducers?
>
> Cheers,
>
> Chris
>
> On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <yash.ma...@gmail.com> wrote:
>
> > Hi all,
> >
> > I'd like to start a discussion thread on this small KIP -
> >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> >
> > It proposes the addition of a new Kafka Connect worker configuration to
> > allow configuring REST API request timeouts.
> >
> > Thanks,
> > Yash
> >
>

Reply via email to