Hey Chris,

Thanks for making the updates.

The updated definition of last_modified looks good to me. As a continuation
to point number 2, could we also mention that this could be used to get
insights into the propagation of the cluster wide log level updates. It is
implicit but probably better to add it I feel.

Regarding

It's a little indirect on the front of worker restart behavior, so let me
> know if that especially should be fleshed out a bit more (possibly by
> calling it out in the "Config topic records" section).


Yeah I would lean on the side of calling it out explicitly. Since the
behaviour is similar to the current dynamically set log levels (i.e
resetting to the log4j config files levels) so we can call it out stating
that similarity just for completeness sake. It would be useful info for
new/medium level users reading the KIP considering worker restarts is not
uncommon.


Thanks, I'm glad that this seems reasonable. I've tentatively added this to
> the rejected alternatives section, but am happy to continue the
> conversation if anyone wants to explore that possibility further.


+1

I had a nit level suggestion but not sure if it makes sense but would still
call it out. The entire namespace name in the config records key (along
with logger-cluster prefix) seems to be a bit too verbose. My first thought
was to not have the prefix org.apache.kafka.connect in the keys considering
it is the root namespace. But since logging can be enabled at a root level,
can we just use initials like (o.a.k.c) which is also a standard practice.
Let me know what you think?

Lastly, I was also thinking if we could introduce a new parameter which
takes a subset of worker ids and enables logging for them in one go. But
this is already achievable by invoking scope=worker endpoint n times to
reflect on n workers so maybe not a necessary change. But this could be
useful on a large cluster. Do you think this is worth listing in the Future
Work section? It's not important so can be ignored as well.

Thanks!
Sagar.


On Wed, Sep 6, 2023 at 12:08 AM Chris Egerton <chr...@aiven.io.invalid>
wrote:

> Hi Sagar,
>
> Thanks for your thoughts! Responses inline:
>
> > 1) Considering that you have now clarified here that last_modified field
> would be stored in-memory, it is not mentioned in the KIP. Although that's
> something that's understandable, it wasn't apparent when reading the KIP.
> Probably, we could mention it? Also, what happens if a worker restarts? In
> that case, since the log level update message would be pre-dating the
> startup of the worker, it would be ignored? We should probably mention that
> behaviour as well IMO.
>
> I've tweaked the second paragraph of the "Last modified timestamp" section
> to try to clarify this without getting too verbose: "Modification times
> will be tracked in-memory and determined by when they are applied by the
> worker, as opposed to when they are requested by the user or persisted to
> the config topic (details below). If no modifications to the namespace have
> been made since the worker finished startup, they will be null." Does that
> feel sufficient? It's a little indirect on the front of worker restart
> behavior, so let me know if that especially should be fleshed out a bit
> more (possibly by calling it out in the "Config topic records" section).
>
> > 2) Staying on the last_modified field, it's utility is also not too clear
> to me. Can you add some examples of how it can be useful for debugging etc?
>
> The cluster-wide API relaxes the consistency guarantees of the existing
> worker-local API. With the latter, users can be certain that once they
> receive a 2xx response, the logging level on that worker has been updated.
> With the former, users know that the logging level will eventually be
> updated, but insight into the propagation of that update across the cluster
> is limited. Although it's a little primitive, I'm hoping that the last
> modified timestamp will be enough to help shed some light on this process.
> We could also explore exposing the provenance of logging levels (which maps
> fairly cleanly to the scope of the request from which they originated), but
> that feels like overkill at the moment.
>
> > 3) In the test plan, should we also add a test when the scope parameter
> passed is non-null and neither worker nor cluster? In this case the
> behaviour should be similar to the default case.
>
> Good call! Done.
>
> > 4) I had the same question as Yash regarding persistent cluster-wide
> logging level. I think you have explained it well and we can skip it for
> now.
>
> Thanks, I'm glad that this seems reasonable. I've tentatively added this to
> the rejected alternatives section, but am happy to continue the
> conversation if anyone wants to explore that possibility further.
>
> Cheers,
>
> Chris
>
> On Tue, Sep 5, 2023 at 11:48 AM Sagar <sagarmeansoc...@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Thanks for the KIP. Seems like a useful feature. I have some more
> > questions/comments:
> >
> > 1) Considering that you have now clarified here that last_modified field
> > would be stored in-memory, it is not mentioned in the KIP. Although
> that's
> > something that's understandable, it wasn't apparent when reading the KIP.
> > Probably, we could mention it? Also, what happens if a worker restarts?
> In
> > that case, since the log level update message would be pre-dating the
> > startup of the worker, it would be ignored? We should probably mention
> that
> > behaviour as well IMO.
> >
> > 2) Staying on the last_modified field, it's utility is also not too clear
> > to me. Can you add some examples of how it can be useful for debugging
> etc?
> >
> > 3) In the test plan, should we also add a test when the scope parameter
> > passed is non-null and neither worker nor cluster? In this case the
> > behaviour should be similar to the default case.
> >
> > 4) I had the same question as Yash regarding persistent cluster-wide
> > logging level. I think you have explained it well and we can skip it for
> > now.
> >
> > Thanks!
> > Sagar.
> >
> > On Tue, Sep 5, 2023 at 8:49 PM Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > Thank you so much for the generous review comments! Happy to see
> interest
> > > in this feature. Inline responses follow.
> > >
> > >
> > > Ashwin:
> > >
> > > > Don't you foresee a future scope type which may require cluster
> > metadata
> > > ?
> > > In that case, isn't it better to forward the requests to the leader in
> > the
> > > initial implementation ?
> > >
> > > I agree with Yash here: we can cross that bridge when we come to it,
> > unless
> > > there are problems that we'd encounter then that would be better
> > addressed
> > > by adding request forwarding now. One potential argument I can think of
> > is
> > > that the UX would be a little strange if the semantics for this
> endpoint
> > > differ depending on the value of the scope parameter (some values would
> > be
> > > affected by in-progress rebalances, and some would not), but I don't
> know
> > > if making scope=cluster more brittle in the name of consistency with,
> > e.g.,
> > > scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?
> > >
> > > > I would also recommend an additional system test for Standalone
> herder
> > to
> > > ensure that the new scope parameter is honored and the response
> contains
> > > the last modified time.
> > >
> > > Ah, great call! I love the new testing plan section. I also share
> Yash's
> > > concerns about adding a new system test though (at this point, they're
> so
> > > painful to write, test, and debug that in most circumstances I consider
> > > them a last resort). Do you think it'd be reasonable to add end-to-end
> > > verification for this with an integration test instead?
> > >
> > >
> > > Yash:
> > >
> > > > From the proposed changes section, it isn't very clear to me how
> we'll
> > be
> > > tracking this last modified timestamp to be returned in responses for
> the
> > > *GET
> > > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could you
> > > please elaborate on this? Also, will we track the last modified
> timestamp
> > > even for worker scoped modifications where we won't write any records
> to
> > > the config topic and the requests will essentially be processed
> > > synchronously?
> > >
> > > RE timestamp tracking: I was thinking we'd store the timestamp for each
> > > level in-memory and, whenever we change the level for a namespace,
> update
> > > its timestamp to the current wallclock time. Concretely, this means
> we'd
> > > want the timestamp for some logger `logger` to be as soon as possible
> > after
> > > the call to `logger.setLevel(level)` for some level `level`. I'm
> honestly
> > > unsure how to clarify this further in the KIP; is there anything in
> there
> > > that strikes you as particularly ambiguous that we can tweak to be more
> > > clear?
> > >
> > > RE scope distinction for timestamps: I've updated the KIP to clarify
> this
> > > point, adding this sentence: "Timestamps will be updated regardless of
> > > whether the namespace update was applied using scope=worker or
> > > scope=cluster.". Let me know what you think
> > >
> > > > In the current synchronous implementation for the *PUT
> > > /admin/loggers/{logger} *endpoint, we return a 404 error if the level
> is
> > > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new
> cluster
> > > scoped variant of the endpoint will be asynchronous, can we also add a
> > > validation to synchronously surface erroneous log levels to users?
> > >
> > > Good call! I think we don't have to be explicit about this in the
> > proposed
> > > changes section, but it's a great fit for the testing plan, where I've
> > > added it: "Ensure that cluster-scoped requests with invalid logging
> > levels
> > > are rejected with a 404 response"
> > >
> > > > I'm curious to know what the rationale here is? In KIP-745, the
> stated
> > > reasoning behind ignoring restart requests during worker startup was
> that
> > > the worker will anyway be starting all connectors and tasks assigned to
> > it
> > > so the restart request is essentially meaningless. With the log level
> API
> > > however, wouldn't it make more sense to apply any cluster scoped
> > > modifications to new workers in the cluster too? This also applies to
> any
> > > workers that are restarted soon after a request is made to *PUT
> > > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we
> could
> > > stack up all the cluster scoped log level modification requests during
> > the
> > > config topic read at worker startup and apply the latest ones per
> > namespace
> > > (assuming older records haven't already been compacted) after we've
> > > finished reading to the end of the config topic?
> > >
> > > It's definitely possible to implement persistent cluster-wide logging
> > level
> > > adjustments, with the possible exception that none will be applied
> until
> > > the worker has finished reading to the end of the config topic. The
> > > rationale for keeping these updates ephemeral is to continue to give
> > > priority to workers' Log4j configuration files, with the underlying
> > > philosophy that this endpoint is still only intended for debugging
> > > purposes, as opposed to cluster-wide configuration. Permanent changes
> can
> > > already be made by tweaking the Log4j file for a worker and then
> > restarting
> > > it. If a restart is too expensive for a permanent change, then the
> change
> > > can be applied immediately via the REST API, and staged via the Log4j
> > > configuration file (which will then be used the next time the worker is
> > > restarted, whenever that happens).
> > >
> > > We could potentially explore persistent cluster-wide logging
> adjustments
> > in
> > > the future (or, if people feel strongly, right now), but it'd be more
> > > complex. How would we handle the case where a user has updated their
> > Log4j
> > > configuration file but continues to see the same logging level for a
> > > namespace, in a way that greatly reduces or even eliminates the odds of
> > > headaches or footguns? How would we allow users to reset or undo
> > > cluster-wide configuration updates, in order to revert back to
> whatever's
> > > specified in their Log4j configuration file? Would we want to give
> users
> > > control over whether a dynamic update is persistent or ephemeral? What
> > > would the implications be for finer-grained scoping (especially with
> > > regards to resetting dynamic changes)? Overall it doesn't seem worth
> the
> > > work to tackle all of this right now, as long as we're not painting
> > > ourselves into a corner. But as always, happy to hear what you and
> others
> > > think!
> > >
> > >
> > > Federico:
> > >
> > > > Due to the idempotent nature of PUT, I guess that the last_modified
> > > timestamp won't change if the same request is repeated successively.
> > > Should we add a unit test for that?
> > >
> > > Great call! I've added this to the testing plan with unit and
> end-to-end
> > > tests (seems fairly cheap to do both; LMK if e2e feels like overkill
> > > though).
> > >
> > >
> > > Thanks again for all of your thoughts!
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com>
> > > wrote:
> > >
> > > > Hi Chris, thanks. This looks like a useful feature.
> > > >
> > > > Due to the idempotent nature of PUT, I guess that the last_modified
> > > > timestamp won't change if the same request is repeated successively.
> > > > Should we add a unit test for that?
> > > >
> > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > Hi Chris,
> > > > >
> > > > > Thanks for thinking about this useful feature !
> > > > > I had a question regarding
> > > > >
> > > > > > Since cluster metadata is not required to handle these types of
> > > > request,
> > > > > they will not be forwarded to the leader
> > > > >
> > > > > And later, we also mention about supporting more scope types in the
> > > > future.
> > > > > Don't you foresee a future scope type which may require cluster
> > > metadata
> > > > ?
> > > > > In that case, isn't it better to forward the requests to the leader
> > in
> > > > the
> > > > > initial implementation ?
> > > > >
> > > > > I would also recommend an additional system test for Standalone
> > herder
> > > to
> > > > > ensure that the new scope parameter is honored and the response
> > > contains
> > > > > the last modified time.
> > > > >
> > > > > Thanks,
> > > > > Ashwin
> > > > >
> > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton
> <chr...@aiven.io.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Can't imagine a worse time to publish a new KIP (it's late on a
> > > Friday
> > > > and
> > > > > > we're in the middle of the 3.6.0 release), but I wanted to put
> > forth
> > > > > > KIP-976 for discussion:
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> > > > > >
> > > > > > TL;DR: The API to dynamically adjust log levels at runtime with
> > > > Connect is
> > > > > > great, and I'd like to augment it with support to adjust log
> levels
> > > for
> > > > > > every worker in the cluster (instead of just the worker that
> > receives
> > > > the
> > > > > > REST request).
> > > > > >
> > > > > > I look forward to everyone's thoughts, but expect that this will
> > > > probably
> > > > > > take a bump or two once the dust has settled on 3.6.0. Huge
> thanks
> > to
> > > > > > everyone that's contributed to that release so far, especially
> our
> > > > release
> > > > > > manager Satish!
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > >
> > >
> >
>

Reply via email to