Hi Randall,

Thanks for the quick review! I'll respond to your comments with the same
numbering that you gave them.

1. (Have added clarification to the KIP but will copy here for anyone
interested): Most of these resources are task-specific and include things
like file descriptors, network connections, or resource locks. However,
there are also some worker resources that are also still retained,
including the thread for the task itself, the memory allocated for it, and
the producer or consumer set up by the framework on its behalf.

2. Yeah, initially I did consider the approach of putting the connector-
and task-specific MBeans inside the existing namespaces for those, and it's
definitely not a bad approach. However, at the moment, the lifetimes of the
"kafka.connect:type=connector-metrics,connector=..." and
"kafka.connect:type=connector-task-metrics,connector=..." MBeans are tied
to the lifetimes of the connectors and tasks that they report on. If a
connector is deleted and some of its tasks have to be abandoned during
shutdown, those MBeans are still removed. It may break some monitoring
tools if those MBeans are still present even after the framework has
deleted a connector and/or its tasks. This also gives us more room to add
new connector- or task-specific metrics for abandoned instances in the
future if we'd like to. I've updated the "Rejected Alternatives" section to
include this rationale but am happy to continue the discussion.

3. I may not fully understand what you're asking here so please let me know
if this doesn't answer your question. The "total" and "current" metrics for
abandoned connector/task instances will both be incremented if/when a
worker requests that they shut down, they fail to shut down within the
graceful shutdown period, and the framework abandons them. If this happens
multiple times in a row without any of those abandoned connectors or tasks
completing shutdown, those metrics will increment every time it does. As
soon as a connector or task completes its shutdown or throws an exception
in the process, the "current" metric will decrement.

4. Both good points; added to the KIP.

5. I was hoping we could deprecate the "task.shutdown.graceful.timeout.ms"
in the same release as the KIP; my understanding of deprecation is that
it's permitted in a minor release by semantic versioning (
https://semver.org/#how-should-i-handle-deprecating-functionality). Happy
to strip the language about removal, though; agreed that it may never be
worth it to actually remove this entirely.

6. I think the biggest advantage of the change in config name is that, when
people talk about Connect, they use the term "tasks" to refer exclusively
to connector tasks and not connectors, whereas the term "connector" can
refer a connector and all of its tasks (the "logical" connector that the
user configures, as opposed to the "physical" Java Connector object that
the framework brings up). If it's not worth the hassle to be completely
correct with our choice of diction, then yes, we can just co-opt the
existing "task.shutdown.graceful.timeout.ms" property. At the moment I've
added an entry in "Rejected Alternatives" but happy to discuss.

7. Given that this is a worker-only config property that doesn't remove
functionality from the framework (users can always specify a higher value
for "connector.shutdown.graceful.timeout.ms" that matches whatever they
used to use for "task.shutdown.graceful.timeout.ms"), I don't see how this
introduces compatibility limitations between new workers and old
connectors, or vice-versa. Could you share an example of where we might run
into trouble?

8. I think a sequence number provides significant value in the specific
case that there is an abandoned connector or task that is still emitting
log messages. From my experience with reading worker logs while trying to
debug behavior in this type of situation, things can get pretty confusing
if there's something like what appears to be a successful shutdown of a
sink task followed by calls to put() by the worker on that same task.
Providing distinction between those different instances would address that
need. We might consider either altering the formatting of the connector
context to be less disruptive (maybe instead of embedding the sequence
number inside the brackets, place it afterwards--something like
"[local-file-source|task-0] (instance 1)"?), or not adding the sequence
number to the connector context at all, and instead manually embedding it
in the log messages that the worker is coded to produce.

Wishing you and everyone a nice weekend!

Cheers,

Chris

On Fri, May 8, 2020 at 2:28 PM Randall Hauch <[email protected]> wrote:

> Thanks for the KIP, Chris. This will be a nice improvement on top of the
> work you're already doing in
> https://issues.apache.org/jira/browse/KAFKA-9374
>
> Overall I like the direction this is going, but I do have some
> comments/questions/suggestions, all of which are pretty minor except for
> the last three that deal with the proposed new worker config property:
>
> 1. Can you clarify in the last sentence of the Background section which
> resources may not be cleaned up. IIUC, these are connector-specific
> resources and are not worker resources. Is that true?
> 2. The KIP proposes adding two new MBean names for the connector-specific
> metrics (e.g.,
> "kafka.connect:type=abandoned-connector-metrics,connector=([-.\w]+)") and
> task-specific metrics (e.g.,
>
> "kafka.connect:type=abandoned-task-metrics,connector=([-.\w]+),task=([\d]+)").
> Did you consider adding these to the existing MBean names
> "kafka.connect:type=connector-metrics,connector=..." and
> "kafka.connect:type=connector-task-metrics,connector=...", respectively?
> What is the benefit of having these in separate MBeans? In fact, I wonder
> if having them in the existing MBeans and naming the metrics similarly to
> the worker's new metrics would help correlate them.
> 3. The new metrics often include the phrase "instances of this
> connector/task that the worker has abandoned". Are these really connector
> or task instances, or are they requests? Perhaps they are effectively the
> same thing, though might it ever be the case that Connect attempts to clean
> up connector and task instances, in which case the subsequent
> `Connector.stop()` for example might also block.
> 4. The KIP proposes adding the new configuration property "
> connector.shutdown.graceful.timeout.ms", but doesn't say what the
> importance is. I presume it would be LOW just like the existing task
> timeout, but it'd be good to specify this. It might also be worthwhile to
> mention that the default will match the current default of the task
> timeout.
> 5. The section about the new configuration property says: "Additionally,
> the task.shutdown.graceful.timeout.ms property will be deprecated and
> removed in a later release." We will deprecate it in some future release
> and then possibly remove it in a subsequent major release. Also, it is
> possible that we never actually remove the old configuration, so perhaps
> this sentence could be a bit more precise: "Additionally, the
> task.shutdown.graceful.timeout.ms property will be deprecated in a later
> release."
> 6. I presume the "connector.shutdown.graceful.timeout.ms" property is
> being
> proposed because the existing "task.shutdown.graceful.timeout.ms" property
> is targeted only to tasks. While coopting the existing task-based property
> to use for connectors may not be desirable, wouldn't using the new
> connector-based property and deprecating the old task-based property cause
> a similar mismatch? If so, why go through the hassle of the config changes
> in the first place? (Even if you disagree that there is no mismatch and
> it's worth the switch, using the existing task-based timeout property for
> task and connectors should be a rejected alternative.)
> 7. Now, I actually think we will prefer to *not* remove many deprecated
> methods/configs/classes because doing so will artificially limit the number
> of Connect runtime versions into which a particular connector version could
> be installed. Given that, and given #6 above, does that change the calculus
> in the second rejected alternative about keeping both the task- and
> connector-based timeout property? I actually think it might.
> 8. How much value does the sequence number in the logging context bring? Is
> this really worth the risk given the backward compatibility concerns?
>
> I've added this to the AK 2.6.0 release plan (
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=152113430
> ).
> If we can't wrap up this KIP and approve/merge a corresponding PR within
> the 2.6 timeframe, we can move this KIP to the "postponed" section of the
> 2.6.
>
> Best regards,
>
> Randall
>
> On Fri, May 8, 2020 at 1:58 PM Christopher Egerton <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I've introduced a KIP to improve how the Connect framework handles
> > abandoned connectors and tasks:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-611%3A+Improved+Handling+of+Abandoned+Connectors+and+Tasks
> >
> > The improvements currently being proposed include new JMX metrics, a
> slight
> > shuffling of worker configuration properties, and some logging additions.
> >
> > Would like to hear your thoughts on this and see if we can possibly
> squeeze
> > it in in time for the upcoming 2.6 release :)
> >
> > Cheers,
> >
> > Chris
> >
>

Reply via email to