Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Yash Mayya Wed, 06 Sep 2023 05:36:16 -0700

Hi Chris,

Thanks for the clarification on the last modified timestamp tracking here
and on the KIP, things look good to me now.


On the persistence front, I hadn't considered the interplay between levels
set by the log level REST APIs and those set by the log4j configuration
files, and the user confusion that could arise from it. I think your
argument for keeping the changes made by these log level admin endpoints
ephemeral makes sense, thanks!

Hi Sagar,

> The entire namespace name in the config
> records key (along with logger-cluster prefix)
> seems to be a bit too verbose. My first
> thought was to not have the prefix
> org.apache.kafka.connect in the keys
> considering it is the root namespace. But
> since logging can be enabled at a root level,
> can we just use initials like (o.a.k.c) which is
> also a standard practice.

We allow modifying the log levels for any namespace - i.e. even packages
and classes outside of Kafka Connect itself (think connector classes,
dependencies etc.). I'm not sure I follow how we'd avoid naming clashes
without using the fully qualified name?

> I was also thinking if we could introduce a
>  new parameter which takes a subset of
> worker ids and enables logging for them in
> one go

The future section already talks about potentially introducing new scopes
such as a specific connector or a specific connector plugin. Are there any
use cases apart from these that would be satisfied by allowing changing the
log levels on a subset of workers in a Kafka Connect cluster?

Thanks,
Yash

On Wed, Sep 6, 2023 at 7:41 AM Sagar <sagarmeansoc...@gmail.com> wrote:

> Hey Chris,
>
> Thanks for making the updates.
>
> The updated definition of last_modified looks good to me. As a continuation
> to point number 2, could we also mention that this could be used to get
> insights into the propagation of the cluster wide log level updates. It is
> implicit but probably better to add it I feel.
>
> Regarding
>
> It's a little indirect on the front of worker restart behavior, so let me
> > know if that especially should be fleshed out a bit more (possibly by
> > calling it out in the "Config topic records" section).
>
>
> Yeah I would lean on the side of calling it out explicitly. Since the
> behaviour is similar to the current dynamically set log levels (i.e
> resetting to the log4j config files levels) so we can call it out stating
> that similarity just for completeness sake. It would be useful info for
> new/medium level users reading the KIP considering worker restarts is not
> uncommon.
>
>
> Thanks, I'm glad that this seems reasonable. I've tentatively added this to
> > the rejected alternatives section, but am happy to continue the
> > conversation if anyone wants to explore that possibility further.
>
>
> +1
>
> I had a nit level suggestion but not sure if it makes sense but would still
> call it out. The entire namespace name in the config records key (along
> with logger-cluster prefix) seems to be a bit too verbose. My first thought
> was to not have the prefix org.apache.kafka.connect in the keys considering
> it is the root namespace. But since logging can be enabled at a root level,
> can we just use initials like (o.a.k.c) which is also a standard practice.
> Let me know what you think?
>
> Lastly, I was also thinking if we could introduce a new parameter which
> takes a subset of worker ids and enables logging for them in one go. But
> this is already achievable by invoking scope=worker endpoint n times to
> reflect on n workers so maybe not a necessary change. But this could be
> useful on a large cluster. Do you think this is worth listing in the Future
> Work section? It's not important so can be ignored as well.
>
> Thanks!
> Sagar.
>
>
> On Wed, Sep 6, 2023 at 12:08 AM Chris Egerton <chr...@aiven.io.invalid>
> wrote:
>
> > Hi Sagar,
> >
> > Thanks for your thoughts! Responses inline:
> >
> > > 1) Considering that you have now clarified here that last_modified
> field
> > would be stored in-memory, it is not mentioned in the KIP. Although
> that's
> > something that's understandable, it wasn't apparent when reading the KIP.
> > Probably, we could mention it? Also, what happens if a worker restarts?
> In
> > that case, since the log level update message would be pre-dating the
> > startup of the worker, it would be ignored? We should probably mention
> that
> > behaviour as well IMO.
> >
> > I've tweaked the second paragraph of the "Last modified timestamp"
> section
> > to try to clarify this without getting too verbose: "Modification times
> > will be tracked in-memory and determined by when they are applied by the
> > worker, as opposed to when they are requested by the user or persisted to
> > the config topic (details below). If no modifications to the namespace
> have
> > been made since the worker finished startup, they will be null." Does
> that
> > feel sufficient? It's a little indirect on the front of worker restart
> > behavior, so let me know if that especially should be fleshed out a bit
> > more (possibly by calling it out in the "Config topic records" section).
> >
> > > 2) Staying on the last_modified field, it's utility is also not too
> clear
> > to me. Can you add some examples of how it can be useful for debugging
> etc?
> >
> > The cluster-wide API relaxes the consistency guarantees of the existing
> > worker-local API. With the latter, users can be certain that once they
> > receive a 2xx response, the logging level on that worker has been
> updated.
> > With the former, users know that the logging level will eventually be
> > updated, but insight into the propagation of that update across the
> cluster
> > is limited. Although it's a little primitive, I'm hoping that the last
> > modified timestamp will be enough to help shed some light on this
> process.
> > We could also explore exposing the provenance of logging levels (which
> maps
> > fairly cleanly to the scope of the request from which they originated),
> but
> > that feels like overkill at the moment.
> >
> > > 3) In the test plan, should we also add a test when the scope parameter
> > passed is non-null and neither worker nor cluster? In this case the
> > behaviour should be similar to the default case.
> >
> > Good call! Done.
> >
> > > 4) I had the same question as Yash regarding persistent cluster-wide
> > logging level. I think you have explained it well and we can skip it for
> > now.
> >
> > Thanks, I'm glad that this seems reasonable. I've tentatively added this
> to
> > the rejected alternatives section, but am happy to continue the
> > conversation if anyone wants to explore that possibility further.
> >
> > Cheers,
> >
> > Chris
> >
> > On Tue, Sep 5, 2023 at 11:48 AM Sagar <sagarmeansoc...@gmail.com> wrote:
> >
> > > Hey Chris,
> > >
> > > Thanks for the KIP. Seems like a useful feature. I have some more
> > > questions/comments:
> > >
> > > 1) Considering that you have now clarified here that last_modified
> field
> > > would be stored in-memory, it is not mentioned in the KIP. Although
> > that's
> > > something that's understandable, it wasn't apparent when reading the
> KIP.
> > > Probably, we could mention it? Also, what happens if a worker restarts?
> > In
> > > that case, since the log level update message would be pre-dating the
> > > startup of the worker, it would be ignored? We should probably mention
> > that
> > > behaviour as well IMO.
> > >
> > > 2) Staying on the last_modified field, it's utility is also not too
> clear
> > > to me. Can you add some examples of how it can be useful for debugging
> > etc?
> > >
> > > 3) In the test plan, should we also add a test when the scope parameter
> > > passed is non-null and neither worker nor cluster? In this case the
> > > behaviour should be similar to the default case.
> > >
> > > 4) I had the same question as Yash regarding persistent cluster-wide
> > > logging level. I think you have explained it well and we can skip it
> for
> > > now.
> > >
> > > Thanks!
> > > Sagar.
> > >
> > > On Tue, Sep 5, 2023 at 8:49 PM Chris Egerton <chr...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Thank you so much for the generous review comments! Happy to see
> > interest
> > > > in this feature. Inline responses follow.
> > > >
> > > >
> > > > Ashwin:
> > > >
> > > > > Don't you foresee a future scope type which may require cluster
> > > metadata
> > > > ?
> > > > In that case, isn't it better to forward the requests to the leader
> in
> > > the
> > > > initial implementation ?
> > > >
> > > > I agree with Yash here: we can cross that bridge when we come to it,
> > > unless
> > > > there are problems that we'd encounter then that would be better
> > > addressed
> > > > by adding request forwarding now. One potential argument I can think
> of
> > > is
> > > > that the UX would be a little strange if the semantics for this
> > endpoint
> > > > differ depending on the value of the scope parameter (some values
> would
> > > be
> > > > affected by in-progress rebalances, and some would not), but I don't
> > know
> > > > if making scope=cluster more brittle in the name of consistency with,
> > > e.g.,
> > > > scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?
> > > >
> > > > > I would also recommend an additional system test for Standalone
> > herder
> > > to
> > > > ensure that the new scope parameter is honored and the response
> > contains
> > > > the last modified time.
> > > >
> > > > Ah, great call! I love the new testing plan section. I also share
> > Yash's
> > > > concerns about adding a new system test though (at this point,
> they're
> > so
> > > > painful to write, test, and debug that in most circumstances I
> consider
> > > > them a last resort). Do you think it'd be reasonable to add
> end-to-end
> > > > verification for this with an integration test instead?
> > > >
> > > >
> > > > Yash:
> > > >
> > > > > From the proposed changes section, it isn't very clear to me how
> > we'll
> > > be
> > > > tracking this last modified timestamp to be returned in responses for
> > the
> > > > *GET
> > > > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could
> you
> > > > please elaborate on this? Also, will we track the last modified
> > timestamp
> > > > even for worker scoped modifications where we won't write any records
> > to
> > > > the config topic and the requests will essentially be processed
> > > > synchronously?
> > > >
> > > > RE timestamp tracking: I was thinking we'd store the timestamp for
> each
> > > > level in-memory and, whenever we change the level for a namespace,
> > update
> > > > its timestamp to the current wallclock time. Concretely, this means
> > we'd
> > > > want the timestamp for some logger `logger` to be as soon as possible
> > > after
> > > > the call to `logger.setLevel(level)` for some level `level`. I'm
> > honestly
> > > > unsure how to clarify this further in the KIP; is there anything in
> > there
> > > > that strikes you as particularly ambiguous that we can tweak to be
> more
> > > > clear?
> > > >
> > > > RE scope distinction for timestamps: I've updated the KIP to clarify
> > this
> > > > point, adding this sentence: "Timestamps will be updated regardless
> of
> > > > whether the namespace update was applied using scope=worker or
> > > > scope=cluster.". Let me know what you think
> > > >
> > > > > In the current synchronous implementation for the *PUT
> > > > /admin/loggers/{logger} *endpoint, we return a 404 error if the level
> > is
> > > > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new
> > cluster
> > > > scoped variant of the endpoint will be asynchronous, can we also add
> a
> > > > validation to synchronously surface erroneous log levels to users?
> > > >
> > > > Good call! I think we don't have to be explicit about this in the
> > > proposed
> > > > changes section, but it's a great fit for the testing plan, where
> I've
> > > > added it: "Ensure that cluster-scoped requests with invalid logging
> > > levels
> > > > are rejected with a 404 response"
> > > >
> > > > > I'm curious to know what the rationale here is? In KIP-745, the
> > stated
> > > > reasoning behind ignoring restart requests during worker startup was
> > that
> > > > the worker will anyway be starting all connectors and tasks assigned
> to
> > > it
> > > > so the restart request is essentially meaningless. With the log level
> > API
> > > > however, wouldn't it make more sense to apply any cluster scoped
> > > > modifications to new workers in the cluster too? This also applies to
> > any
> > > > workers that are restarted soon after a request is made to *PUT
> > > > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we
> > could
> > > > stack up all the cluster scoped log level modification requests
> during
> > > the
> > > > config topic read at worker startup and apply the latest ones per
> > > namespace
> > > > (assuming older records haven't already been compacted) after we've
> > > > finished reading to the end of the config topic?
> > > >
> > > > It's definitely possible to implement persistent cluster-wide logging
> > > level
> > > > adjustments, with the possible exception that none will be applied
> > until
> > > > the worker has finished reading to the end of the config topic. The
> > > > rationale for keeping these updates ephemeral is to continue to give
> > > > priority to workers' Log4j configuration files, with the underlying
> > > > philosophy that this endpoint is still only intended for debugging
> > > > purposes, as opposed to cluster-wide configuration. Permanent changes
> > can
> > > > already be made by tweaking the Log4j file for a worker and then
> > > restarting
> > > > it. If a restart is too expensive for a permanent change, then the
> > change
> > > > can be applied immediately via the REST API, and staged via the Log4j
> > > > configuration file (which will then be used the next time the worker
> is
> > > > restarted, whenever that happens).
> > > >
> > > > We could potentially explore persistent cluster-wide logging
> > adjustments
> > > in
> > > > the future (or, if people feel strongly, right now), but it'd be more
> > > > complex. How would we handle the case where a user has updated their
> > > Log4j
> > > > configuration file but continues to see the same logging level for a
> > > > namespace, in a way that greatly reduces or even eliminates the odds
> of
> > > > headaches or footguns? How would we allow users to reset or undo
> > > > cluster-wide configuration updates, in order to revert back to
> > whatever's
> > > > specified in their Log4j configuration file? Would we want to give
> > users
> > > > control over whether a dynamic update is persistent or ephemeral?
> What
> > > > would the implications be for finer-grained scoping (especially with
> > > > regards to resetting dynamic changes)? Overall it doesn't seem worth
> > the
> > > > work to tackle all of this right now, as long as we're not painting
> > > > ourselves into a corner. But as always, happy to hear what you and
> > others
> > > > think!
> > > >
> > > >
> > > > Federico:
> > > >
> > > > > Due to the idempotent nature of PUT, I guess that the last_modified
> > > > timestamp won't change if the same request is repeated successively.
> > > > Should we add a unit test for that?
> > > >
> > > > Great call! I've added this to the testing plan with unit and
> > end-to-end
> > > > tests (seems fairly cheap to do both; LMK if e2e feels like overkill
> > > > though).
> > > >
> > > >
> > > > Thanks again for all of your thoughts!
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi Chris, thanks. This looks like a useful feature.
> > > > >
> > > > > Due to the idempotent nature of PUT, I guess that the last_modified
> > > > > timestamp won't change if the same request is repeated
> successively.
> > > > > Should we add a unit test for that?
> > > > >
> > > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid
> >
> > > > > wrote:
> > > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > Thanks for thinking about this useful feature !
> > > > > > I had a question regarding
> > > > > >
> > > > > > > Since cluster metadata is not required to handle these types of
> > > > > request,
> > > > > > they will not be forwarded to the leader
> > > > > >
> > > > > > And later, we also mention about supporting more scope types in
> the
> > > > > future.
> > > > > > Don't you foresee a future scope type which may require cluster
> > > > metadata
> > > > > ?
> > > > > > In that case, isn't it better to forward the requests to the
> leader
> > > in
> > > > > the
> > > > > > initial implementation ?
> > > > > >
> > > > > > I would also recommend an additional system test for Standalone
> > > herder
> > > > to
> > > > > > ensure that the new scope parameter is honored and the response
> > > > contains
> > > > > > the last modified time.
> > > > > >
> > > > > > Thanks,
> > > > > > Ashwin
> > > > > >
> > > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton
> > <chr...@aiven.io.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Can't imagine a worse time to publish a new KIP (it's late on a
> > > > Friday
> > > > > and
> > > > > > > we're in the middle of the 3.6.0 release), but I wanted to put
> > > forth
> > > > > > > KIP-976 for discussion:
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> > > > > > >
> > > > > > > TL;DR: The API to dynamically adjust log levels at runtime with
> > > > > Connect is
> > > > > > > great, and I'd like to augment it with support to adjust log
> > levels
> > > > for
> > > > > > > every worker in the cluster (instead of just the worker that
> > > receives
> > > > > the
> > > > > > > REST request).
> > > > > > >
> > > > > > > I look forward to everyone's thoughts, but expect that this
> will
> > > > > probably
> > > > > > > take a bump or two once the dust has settled on 3.6.0. Huge
> > thanks
> > > to
> > > > > > > everyone that's contributed to that release so far, especially
> > our
> > > > > release
> > > > > > > manager Satish!
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Reply via email to