Thanks for the KIP, Chris. This will be a nice improvement on top of the work you're already doing in https://issues.apache.org/jira/browse/KAFKA-9374
Overall I like the direction this is going, but I do have some comments/questions/suggestions, all of which are pretty minor except for the last three that deal with the proposed new worker config property: 1. Can you clarify in the last sentence of the Background section which resources may not be cleaned up. IIUC, these are connector-specific resources and are not worker resources. Is that true? 2. The KIP proposes adding two new MBean names for the connector-specific metrics (e.g., "kafka.connect:type=abandoned-connector-metrics,connector=([-.\w]+)") and task-specific metrics (e.g., "kafka.connect:type=abandoned-task-metrics,connector=([-.\w]+),task=([\d]+)"). Did you consider adding these to the existing MBean names "kafka.connect:type=connector-metrics,connector=..." and "kafka.connect:type=connector-task-metrics,connector=...", respectively? What is the benefit of having these in separate MBeans? In fact, I wonder if having them in the existing MBeans and naming the metrics similarly to the worker's new metrics would help correlate them. 3. The new metrics often include the phrase "instances of this connector/task that the worker has abandoned". Are these really connector or task instances, or are they requests? Perhaps they are effectively the same thing, though might it ever be the case that Connect attempts to clean up connector and task instances, in which case the subsequent `Connector.stop()` for example might also block. 4. The KIP proposes adding the new configuration property " connector.shutdown.graceful.timeout.ms", but doesn't say what the importance is. I presume it would be LOW just like the existing task timeout, but it'd be good to specify this. It might also be worthwhile to mention that the default will match the current default of the task timeout. 5. The section about the new configuration property says: "Additionally, the task.shutdown.graceful.timeout.ms property will be deprecated and removed in a later release." We will deprecate it in some future release and then possibly remove it in a subsequent major release. Also, it is possible that we never actually remove the old configuration, so perhaps this sentence could be a bit more precise: "Additionally, the task.shutdown.graceful.timeout.ms property will be deprecated in a later release." 6. I presume the "connector.shutdown.graceful.timeout.ms" property is being proposed because the existing "task.shutdown.graceful.timeout.ms" property is targeted only to tasks. While coopting the existing task-based property to use for connectors may not be desirable, wouldn't using the new connector-based property and deprecating the old task-based property cause a similar mismatch? If so, why go through the hassle of the config changes in the first place? (Even if you disagree that there is no mismatch and it's worth the switch, using the existing task-based timeout property for task and connectors should be a rejected alternative.) 7. Now, I actually think we will prefer to *not* remove many deprecated methods/configs/classes because doing so will artificially limit the number of Connect runtime versions into which a particular connector version could be installed. Given that, and given #6 above, does that change the calculus in the second rejected alternative about keeping both the task- and connector-based timeout property? I actually think it might. 8. How much value does the sequence number in the logging context bring? Is this really worth the risk given the backward compatibility concerns? I've added this to the AK 2.6.0 release plan ( https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=152113430). If we can't wrap up this KIP and approve/merge a corresponding PR within the 2.6 timeframe, we can move this KIP to the "postponed" section of the 2.6. Best regards, Randall On Fri, May 8, 2020 at 1:58 PM Christopher Egerton <[email protected]> wrote: > Hi all, > > I've introduced a KIP to improve how the Connect framework handles > abandoned connectors and tasks: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-611%3A+Improved+Handling+of+Abandoned+Connectors+and+Tasks > > The improvements currently being proposed include new JMX metrics, a slight > shuffling of worker configuration properties, and some logging additions. > > Would like to hear your thoughts on this and see if we can possibly squeeze > it in in time for the upcoming 2.6 release :) > > Cheers, > > Chris >
