yashmayya commented on PR #13530: URL: https://github.com/apache/kafka/pull/13530#issuecomment-1660303528
Hey Chris, thanks for the reviews and apologies for the late response. > Honestly, I think the biggest obstacle to preventing pause/resume for connectors that have failed to generate task configs is that it'll look uglier in the code base, especially if we take a naive approach and introduce some kind of public WorkerConnector::setState method (gross). Haha that was exactly the reason I tried to avoid doing so but you're right about the delegate status listener bit, thanks for pointing that out! In fact, it looks like the herder's status listener is wrapped twice - - Once in the `Worker` here - https://github.com/apache/kafka/blob/660e6fe8108e8a9b3481ea1ec20327a099dd8310/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L295 - And then again in the `WorkerConnector` itself like you pointed out - https://github.com/apache/kafka/blob/660e6fe8108e8a9b3481ea1ec20327a099dd8310/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerConnector.java#L98-L99 > Perhaps we could add a WorkerConnector::taskConfigs method that accepts a boolean reportFailure parameter (feel free to rename), which controls whether any exceptions encountered while invoking Connector::taskConfigs (or anywhere else in the method, really) should be reported to the connector's status listener and result in a state change? It's still a little unclean, but it does reflect the needs of our internal API: in one mode, we want these errors to be thrown to the caller without affecting any other state for the connector, and in another mode, we still want these errors to be thrown (so as to surface in REST responses as 500 errors), but also to result in a change in state for the connector. Hm that's an interesting idea but I agree that it feels a little unclean - we'd also need to expand the `WorkerConnector`'s public interface since the method that updates its internal state and calls the wrapped status listener's `onFailure` hook is currently internal and exposing it could lead to questions and confusion around when it can safely be invoked externally (i.e. without affecting any other status updates that could occur internally). WDYT? > Finally, regarding augmenting error messages with a note that tasks may still be running--as long as we don't have to wrap any connector-thrown exceptions in order to introduce this additional wording, I'm in favor of it. I just don't want to add another set of stack frames and another link in the "caused by" chain without good reason since it makes things harder to read (I've seen people copy+paste stack traces that end with ["Tolerance exceeded in error handler"](https://github.com/apache/kafka/blob/30b087ead967b28d459945fe90c80545bf189d1f/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L591) and omit the actual root cause, which is... less than helpful). With the current changes, we're already wrapping the connector thrown exception to use `Failed to update tasks after connector restart / startup` as the top level error message. I definitely get where you're coming from since I've seen the same issue in the past too (with `Tolerance exceeded in error handler`) but I do think that this case is different - users who aren't too familiar with the nuances between a connector and its tasks (and their associated statuses) would probably be fairly confused seeing a failed connector with running tasks. I feel like this situation would be significantly improved by using a message that talks about the task reconfiguration error for the connector status error message instead of using whatever the connector throws. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org