Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Chris Egerton Thu, 07 Sep 2023 08:56:34 -0700

Hi all,

Thanks again for the reviews!



Sagar:

> The updated definition of last_modified looks good to me. As a
continuation
to point number 2, could we also mention that this could be used to get
insights into the propagation of the cluster wide log level updates. It is
implicit but probably better to add it I feel.

Sure, done. Added to the end of the "Config topic records" section: "There
may be some delay between when a REST request with scope=cluster is
received and when all workers have read the corresponding record from the
config topic. The last modified timestamp (details above) can serve as a
rudimentary tool to provide insight into the propagation of a cluster-wide
log level adjustment."

> Yeah I would lean on the side of calling it out explicitly. Since the
behaviour is similar to the current dynamically set log levels (i.e
resetting to the log4j config files levels) so we can call it out stating
that similarity just for completeness sake. It would be useful info for
new/medium level users reading the KIP considering worker restarts is not
uncommon.

Alright, did this too. Added near the end of the "Config topic records"
section: "Restarting a worker will cause it to discard all cluster-wide
dynamic log level adjustments, and revert to the levels specified in its
Log4j configuration. This mirrors the current behavior with per-worker
dynamic log level adjustments."

> I had a nit level suggestion but not sure if it makes sense but would
still
call it out. The entire namespace name in the config records key (along
with logger-cluster prefix) seems to be a bit too verbose. My first thought
was to not have the prefix org.apache.kafka.connect in the keys considering
it is the root namespace. But since logging can be enabled at a root level,
can we just use initials like (o.a.k.c) which is also a standard practice.
Let me know what you think?

Considering these records aren't meant to be user-visible, there doesn't
seem to be a pressing need to reduce their key sizes (though I'll happily
admit that to human eyes, the format is a bit ugly). IMO the increase in
implementation complexity isn't quite worth it, especially considering
there are plenty of logging namespaces that won't begin with
"org.apache.kafka.connect" (likely including all third-party connector
code), like Yash mentions. Is there a motivation for this suggestion that
I'm missing?

> Lastly, I was also thinking if we could introduce a new parameter which
takes a subset of worker ids and enables logging for them in one go. But
this is already achievable by invoking scope=worker endpoint n times to
reflect on n workers so maybe not a necessary change. But this could be
useful on a large cluster. Do you think this is worth listing in the Future
Work section? It's not important so can be ignored as well.

Hmmm... I think I'd rather leave this out for now because I'm just not
certain enough it'd be useful. The one advantage I can think of is
targeting specific workers that are behind a load balancer, but being able
to identify those workers would be a challenge in that scenario anyways.
Besides that, are there any cases that couldn't be addressed more
holistically by targeting based on connector/connector type, like Yash asks?


Ashwin:

Glad we're on the same page RE request forwarding and integration vs.
system tests! Let me know if anything else comes up that you'd like to
discuss.


Yash:

Glad that it makes sense to keep these changes ephemeral. I'm not quite
confident enough to put persistent updates in the "Future work" section but
have a sneaking suspicion that this isn't the last we'll see of that
request. Time will tell...


Thanks again, all!

Cheers,

Chris

On Wed, Sep 6, 2023 at 8:36 AM Yash Mayya <yash.ma...@gmail.com> wrote:

> Hi Chris,
>
> Thanks for the clarification on the last modified timestamp tracking here
> and on the KIP, things look good to me now.
>
> On the persistence front, I hadn't considered the interplay between levels
> set by the log level REST APIs and those set by the log4j configuration
> files, and the user confusion that could arise from it. I think your
> argument for keeping the changes made by these log level admin endpoints
> ephemeral makes sense, thanks!
>
> Hi Sagar,
>
> > The entire namespace name in the config
> > records key (along with logger-cluster prefix)
> > seems to be a bit too verbose. My first
> > thought was to not have the prefix
> > org.apache.kafka.connect in the keys
> > considering it is the root namespace. But
> > since logging can be enabled at a root level,
> > can we just use initials like (o.a.k.c) which is
> > also a standard practice.
>
> We allow modifying the log levels for any namespace - i.e. even packages
> and classes outside of Kafka Connect itself (think connector classes,
> dependencies etc.). I'm not sure I follow how we'd avoid naming clashes
> without using the fully qualified name?
>
> > I was also thinking if we could introduce a
> >  new parameter which takes a subset of
> > worker ids and enables logging for them in
> > one go
>
> The future section already talks about potentially introducing new scopes
> such as a specific connector or a specific connector plugin. Are there any
> use cases apart from these that would be satisfied by allowing changing the
> log levels on a subset of workers in a Kafka Connect cluster?
>
> Thanks,
> Yash
>
> On Wed, Sep 6, 2023 at 7:41 AM Sagar <sagarmeansoc...@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Thanks for making the updates.
> >
> > The updated definition of last_modified looks good to me. As a
> continuation
> > to point number 2, could we also mention that this could be used to get
> > insights into the propagation of the cluster wide log level updates. It
> is
> > implicit but probably better to add it I feel.
> >
> > Regarding
> >
> > It's a little indirect on the front of worker restart behavior, so let me
> > > know if that especially should be fleshed out a bit more (possibly by
> > > calling it out in the "Config topic records" section).
> >
> >
> > Yeah I would lean on the side of calling it out explicitly. Since the
> > behaviour is similar to the current dynamically set log levels (i.e
> > resetting to the log4j config files levels) so we can call it out stating
> > that similarity just for completeness sake. It would be useful info for
> > new/medium level users reading the KIP considering worker restarts is not
> > uncommon.
> >
> >
> > Thanks, I'm glad that this seems reasonable. I've tentatively added this
> to
> > > the rejected alternatives section, but am happy to continue the
> > > conversation if anyone wants to explore that possibility further.
> >
> >
> > +1
> >
> > I had a nit level suggestion but not sure if it makes sense but would
> still
> > call it out. The entire namespace name in the config records key (along
> > with logger-cluster prefix) seems to be a bit too verbose. My first
> thought
> > was to not have the prefix org.apache.kafka.connect in the keys
> considering
> > it is the root namespace. But since logging can be enabled at a root
> level,
> > can we just use initials like (o.a.k.c) which is also a standard
> practice.
> > Let me know what you think?
> >
> > Lastly, I was also thinking if we could introduce a new parameter which
> > takes a subset of worker ids and enables logging for them in one go. But
> > this is already achievable by invoking scope=worker endpoint n times to
> > reflect on n workers so maybe not a necessary change. But this could be
> > useful on a large cluster. Do you think this is worth listing in the
> Future
> > Work section? It's not important so can be ignored as well.
> >
> > Thanks!
> > Sagar.
> >
> >
> > On Wed, Sep 6, 2023 at 12:08 AM Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Sagar,
> > >
> > > Thanks for your thoughts! Responses inline:
> > >
> > > > 1) Considering that you have now clarified here that last_modified
> > field
> > > would be stored in-memory, it is not mentioned in the KIP. Although
> > that's
> > > something that's understandable, it wasn't apparent when reading the
> KIP.
> > > Probably, we could mention it? Also, what happens if a worker restarts?
> > In
> > > that case, since the log level update message would be pre-dating the
> > > startup of the worker, it would be ignored? We should probably mention
> > that
> > > behaviour as well IMO.
> > >
> > > I've tweaked the second paragraph of the "Last modified timestamp"
> > section
> > > to try to clarify this without getting too verbose: "Modification times
> > > will be tracked in-memory and determined by when they are applied by
> the
> > > worker, as opposed to when they are requested by the user or persisted
> to
> > > the config topic (details below). If no modifications to the namespace
> > have
> > > been made since the worker finished startup, they will be null." Does
> > that
> > > feel sufficient? It's a little indirect on the front of worker restart
> > > behavior, so let me know if that especially should be fleshed out a bit
> > > more (possibly by calling it out in the "Config topic records"
> section).
> > >
> > > > 2) Staying on the last_modified field, it's utility is also not too
> > clear
> > > to me. Can you add some examples of how it can be useful for debugging
> > etc?
> > >
> > > The cluster-wide API relaxes the consistency guarantees of the existing
> > > worker-local API. With the latter, users can be certain that once they
> > > receive a 2xx response, the logging level on that worker has been
> > updated.
> > > With the former, users know that the logging level will eventually be
> > > updated, but insight into the propagation of that update across the
> > cluster
> > > is limited. Although it's a little primitive, I'm hoping that the last
> > > modified timestamp will be enough to help shed some light on this
> > process.
> > > We could also explore exposing the provenance of logging levels (which
> > maps
> > > fairly cleanly to the scope of the request from which they originated),
> > but
> > > that feels like overkill at the moment.
> > >
> > > > 3) In the test plan, should we also add a test when the scope
> parameter
> > > passed is non-null and neither worker nor cluster? In this case the
> > > behaviour should be similar to the default case.
> > >
> > > Good call! Done.
> > >
> > > > 4) I had the same question as Yash regarding persistent cluster-wide
> > > logging level. I think you have explained it well and we can skip it
> for
> > > now.
> > >
> > > Thanks, I'm glad that this seems reasonable. I've tentatively added
> this
> > to
> > > the rejected alternatives section, but am happy to continue the
> > > conversation if anyone wants to explore that possibility further.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Sep 5, 2023 at 11:48 AM Sagar <sagarmeansoc...@gmail.com>
> wrote:
> > >
> > > > Hey Chris,
> > > >
> > > > Thanks for the KIP. Seems like a useful feature. I have some more
> > > > questions/comments:
> > > >
> > > > 1) Considering that you have now clarified here that last_modified
> > field
> > > > would be stored in-memory, it is not mentioned in the KIP. Although
> > > that's
> > > > something that's understandable, it wasn't apparent when reading the
> > KIP.
> > > > Probably, we could mention it? Also, what happens if a worker
> restarts?
> > > In
> > > > that case, since the log level update message would be pre-dating the
> > > > startup of the worker, it would be ignored? We should probably
> mention
> > > that
> > > > behaviour as well IMO.
> > > >
> > > > 2) Staying on the last_modified field, it's utility is also not too
> > clear
> > > > to me. Can you add some examples of how it can be useful for
> debugging
> > > etc?
> > > >
> > > > 3) In the test plan, should we also add a test when the scope
> parameter
> > > > passed is non-null and neither worker nor cluster? In this case the
> > > > behaviour should be similar to the default case.
> > > >
> > > > 4) I had the same question as Yash regarding persistent cluster-wide
> > > > logging level. I think you have explained it well and we can skip it
> > for
> > > > now.
> > > >
> > > > Thanks!
> > > > Sagar.
> > > >
> > > > On Tue, Sep 5, 2023 at 8:49 PM Chris Egerton <chr...@aiven.io.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Thank you so much for the generous review comments! Happy to see
> > > interest
> > > > > in this feature. Inline responses follow.
> > > > >
> > > > >
> > > > > Ashwin:
> > > > >
> > > > > > Don't you foresee a future scope type which may require cluster
> > > > metadata
> > > > > ?
> > > > > In that case, isn't it better to forward the requests to the leader
> > in
> > > > the
> > > > > initial implementation ?
> > > > >
> > > > > I agree with Yash here: we can cross that bridge when we come to
> it,
> > > > unless
> > > > > there are problems that we'd encounter then that would be better
> > > > addressed
> > > > > by adding request forwarding now. One potential argument I can
> think
> > of
> > > > is
> > > > > that the UX would be a little strange if the semantics for this
> > > endpoint
> > > > > differ depending on the value of the scope parameter (some values
> > would
> > > > be
> > > > > affected by in-progress rebalances, and some would not), but I
> don't
> > > know
> > > > > if making scope=cluster more brittle in the name of consistency
> with,
> > > > e.g.,
> > > > > scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?
> > > > >
> > > > > > I would also recommend an additional system test for Standalone
> > > herder
> > > > to
> > > > > ensure that the new scope parameter is honored and the response
> > > contains
> > > > > the last modified time.
> > > > >
> > > > > Ah, great call! I love the new testing plan section. I also share
> > > Yash's
> > > > > concerns about adding a new system test though (at this point,
> > they're
> > > so
> > > > > painful to write, test, and debug that in most circumstances I
> > consider
> > > > > them a last resort). Do you think it'd be reasonable to add
> > end-to-end
> > > > > verification for this with an integration test instead?
> > > > >
> > > > >
> > > > > Yash:
> > > > >
> > > > > > From the proposed changes section, it isn't very clear to me how
> > > we'll
> > > > be
> > > > > tracking this last modified timestamp to be returned in responses
> for
> > > the
> > > > > *GET
> > > > > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could
> > you
> > > > > please elaborate on this? Also, will we track the last modified
> > > timestamp
> > > > > even for worker scoped modifications where we won't write any
> records
> > > to
> > > > > the config topic and the requests will essentially be processed
> > > > > synchronously?
> > > > >
> > > > > RE timestamp tracking: I was thinking we'd store the timestamp for
> > each
> > > > > level in-memory and, whenever we change the level for a namespace,
> > > update
> > > > > its timestamp to the current wallclock time. Concretely, this means
> > > we'd
> > > > > want the timestamp for some logger `logger` to be as soon as
> possible
> > > > after
> > > > > the call to `logger.setLevel(level)` for some level `level`. I'm
> > > honestly
> > > > > unsure how to clarify this further in the KIP; is there anything in
> > > there
> > > > > that strikes you as particularly ambiguous that we can tweak to be
> > more
> > > > > clear?
> > > > >
> > > > > RE scope distinction for timestamps: I've updated the KIP to
> clarify
> > > this
> > > > > point, adding this sentence: "Timestamps will be updated regardless
> > of
> > > > > whether the namespace update was applied using scope=worker or
> > > > > scope=cluster.". Let me know what you think
> > > > >
> > > > > > In the current synchronous implementation for the *PUT
> > > > > /admin/loggers/{logger} *endpoint, we return a 404 error if the
> level
> > > is
> > > > > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new
> > > cluster
> > > > > scoped variant of the endpoint will be asynchronous, can we also
> add
> > a
> > > > > validation to synchronously surface erroneous log levels to users?
> > > > >
> > > > > Good call! I think we don't have to be explicit about this in the
> > > > proposed
> > > > > changes section, but it's a great fit for the testing plan, where
> > I've
> > > > > added it: "Ensure that cluster-scoped requests with invalid logging
> > > > levels
> > > > > are rejected with a 404 response"
> > > > >
> > > > > > I'm curious to know what the rationale here is? In KIP-745, the
> > > stated
> > > > > reasoning behind ignoring restart requests during worker startup
> was
> > > that
> > > > > the worker will anyway be starting all connectors and tasks
> assigned
> > to
> > > > it
> > > > > so the restart request is essentially meaningless. With the log
> level
> > > API
> > > > > however, wouldn't it make more sense to apply any cluster scoped
> > > > > modifications to new workers in the cluster too? This also applies
> to
> > > any
> > > > > workers that are restarted soon after a request is made to *PUT
> > > > > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we
> > > could
> > > > > stack up all the cluster scoped log level modification requests
> > during
> > > > the
> > > > > config topic read at worker startup and apply the latest ones per
> > > > namespace
> > > > > (assuming older records haven't already been compacted) after we've
> > > > > finished reading to the end of the config topic?
> > > > >
> > > > > It's definitely possible to implement persistent cluster-wide
> logging
> > > > level
> > > > > adjustments, with the possible exception that none will be applied
> > > until
> > > > > the worker has finished reading to the end of the config topic. The
> > > > > rationale for keeping these updates ephemeral is to continue to
> give
> > > > > priority to workers' Log4j configuration files, with the underlying
> > > > > philosophy that this endpoint is still only intended for debugging
> > > > > purposes, as opposed to cluster-wide configuration. Permanent
> changes
> > > can
> > > > > already be made by tweaking the Log4j file for a worker and then
> > > > restarting
> > > > > it. If a restart is too expensive for a permanent change, then the
> > > change
> > > > > can be applied immediately via the REST API, and staged via the
> Log4j
> > > > > configuration file (which will then be used the next time the
> worker
> > is
> > > > > restarted, whenever that happens).
> > > > >
> > > > > We could potentially explore persistent cluster-wide logging
> > > adjustments
> > > > in
> > > > > the future (or, if people feel strongly, right now), but it'd be
> more
> > > > > complex. How would we handle the case where a user has updated
> their
> > > > Log4j
> > > > > configuration file but continues to see the same logging level for
> a
> > > > > namespace, in a way that greatly reduces or even eliminates the
> odds
> > of
> > > > > headaches or footguns? How would we allow users to reset or undo
> > > > > cluster-wide configuration updates, in order to revert back to
> > > whatever's
> > > > > specified in their Log4j configuration file? Would we want to give
> > > users
> > > > > control over whether a dynamic update is persistent or ephemeral?
> > What
> > > > > would the implications be for finer-grained scoping (especially
> with
> > > > > regards to resetting dynamic changes)? Overall it doesn't seem
> worth
> > > the
> > > > > work to tackle all of this right now, as long as we're not painting
> > > > > ourselves into a corner. But as always, happy to hear what you and
> > > others
> > > > > think!
> > > > >
> > > > >
> > > > > Federico:
> > > > >
> > > > > > Due to the idempotent nature of PUT, I guess that the
> last_modified
> > > > > timestamp won't change if the same request is repeated
> successively.
> > > > > Should we add a unit test for that?
> > > > >
> > > > > Great call! I've added this to the testing plan with unit and
> > > end-to-end
> > > > > tests (seems fairly cheap to do both; LMK if e2e feels like
> overkill
> > > > > though).
> > > > >
> > > > >
> > > > > Thanks again for all of your thoughts!
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <
> fedeval...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Chris, thanks. This looks like a useful feature.
> > > > > >
> > > > > > Due to the idempotent nature of PUT, I guess that the
> last_modified
> > > > > > timestamp won't change if the same request is repeated
> > successively.
> > > > > > Should we add a unit test for that?
> > > > > >
> > > > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin
> <apan...@confluent.io.invalid
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Chris,
> > > > > > >
> > > > > > > Thanks for thinking about this useful feature !
> > > > > > > I had a question regarding
> > > > > > >
> > > > > > > > Since cluster metadata is not required to handle these types
> of
> > > > > > request,
> > > > > > > they will not be forwarded to the leader
> > > > > > >
> > > > > > > And later, we also mention about supporting more scope types in
> > the
> > > > > > future.
> > > > > > > Don't you foresee a future scope type which may require cluster
> > > > > metadata
> > > > > > ?
> > > > > > > In that case, isn't it better to forward the requests to the
> > leader
> > > > in
> > > > > > the
> > > > > > > initial implementation ?
> > > > > > >
> > > > > > > I would also recommend an additional system test for Standalone
> > > > herder
> > > > > to
> > > > > > > ensure that the new scope parameter is honored and the response
> > > > > contains
> > > > > > > the last modified time.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Ashwin
> > > > > > >
> > > > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton
> > > <chr...@aiven.io.invalid
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Can't imagine a worse time to publish a new KIP (it's late
> on a
> > > > > Friday
> > > > > > and
> > > > > > > > we're in the middle of the 3.6.0 release), but I wanted to
> put
> > > > forth
> > > > > > > > KIP-976 for discussion:
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> > > > > > > >
> > > > > > > > TL;DR: The API to dynamically adjust log levels at runtime
> with
> > > > > > Connect is
> > > > > > > > great, and I'd like to augment it with support to adjust log
> > > levels
> > > > > for
> > > > > > > > every worker in the cluster (instead of just the worker that
> > > > receives
> > > > > > the
> > > > > > > > REST request).
> > > > > > > >
> > > > > > > > I look forward to everyone's thoughts, but expect that this
> > will
> > > > > > probably
> > > > > > > > take a bump or two once the dust has settled on 3.6.0. Huge
> > > thanks
> > > > to
> > > > > > > > everyone that's contributed to that release so far,
> especially
> > > our
> > > > > > release
> > > > > > > > manager Satish!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Chris
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Reply via email to