Hi Chris, Thanks for the clarification on the last modified timestamp tracking here and on the KIP, things look good to me now.
On the persistence front, I hadn't considered the interplay between levels set by the log level REST APIs and those set by the log4j configuration files, and the user confusion that could arise from it. I think your argument for keeping the changes made by these log level admin endpoints ephemeral makes sense, thanks! Hi Sagar, > The entire namespace name in the config > records key (along with logger-cluster prefix) > seems to be a bit too verbose. My first > thought was to not have the prefix > org.apache.kafka.connect in the keys > considering it is the root namespace. But > since logging can be enabled at a root level, > can we just use initials like (o.a.k.c) which is > also a standard practice. We allow modifying the log levels for any namespace - i.e. even packages and classes outside of Kafka Connect itself (think connector classes, dependencies etc.). I'm not sure I follow how we'd avoid naming clashes without using the fully qualified name? > I was also thinking if we could introduce a > new parameter which takes a subset of > worker ids and enables logging for them in > one go The future section already talks about potentially introducing new scopes such as a specific connector or a specific connector plugin. Are there any use cases apart from these that would be satisfied by allowing changing the log levels on a subset of workers in a Kafka Connect cluster? Thanks, Yash On Wed, Sep 6, 2023 at 7:41 AM Sagar <sagarmeansoc...@gmail.com> wrote: > Hey Chris, > > Thanks for making the updates. > > The updated definition of last_modified looks good to me. As a continuation > to point number 2, could we also mention that this could be used to get > insights into the propagation of the cluster wide log level updates. It is > implicit but probably better to add it I feel. > > Regarding > > It's a little indirect on the front of worker restart behavior, so let me > > know if that especially should be fleshed out a bit more (possibly by > > calling it out in the "Config topic records" section). > > > Yeah I would lean on the side of calling it out explicitly. Since the > behaviour is similar to the current dynamically set log levels (i.e > resetting to the log4j config files levels) so we can call it out stating > that similarity just for completeness sake. It would be useful info for > new/medium level users reading the KIP considering worker restarts is not > uncommon. > > > Thanks, I'm glad that this seems reasonable. I've tentatively added this to > > the rejected alternatives section, but am happy to continue the > > conversation if anyone wants to explore that possibility further. > > > +1 > > I had a nit level suggestion but not sure if it makes sense but would still > call it out. The entire namespace name in the config records key (along > with logger-cluster prefix) seems to be a bit too verbose. My first thought > was to not have the prefix org.apache.kafka.connect in the keys considering > it is the root namespace. But since logging can be enabled at a root level, > can we just use initials like (o.a.k.c) which is also a standard practice. > Let me know what you think? > > Lastly, I was also thinking if we could introduce a new parameter which > takes a subset of worker ids and enables logging for them in one go. But > this is already achievable by invoking scope=worker endpoint n times to > reflect on n workers so maybe not a necessary change. But this could be > useful on a large cluster. Do you think this is worth listing in the Future > Work section? It's not important so can be ignored as well. > > Thanks! > Sagar. > > > On Wed, Sep 6, 2023 at 12:08 AM Chris Egerton <chr...@aiven.io.invalid> > wrote: > > > Hi Sagar, > > > > Thanks for your thoughts! Responses inline: > > > > > 1) Considering that you have now clarified here that last_modified > field > > would be stored in-memory, it is not mentioned in the KIP. Although > that's > > something that's understandable, it wasn't apparent when reading the KIP. > > Probably, we could mention it? Also, what happens if a worker restarts? > In > > that case, since the log level update message would be pre-dating the > > startup of the worker, it would be ignored? We should probably mention > that > > behaviour as well IMO. > > > > I've tweaked the second paragraph of the "Last modified timestamp" > section > > to try to clarify this without getting too verbose: "Modification times > > will be tracked in-memory and determined by when they are applied by the > > worker, as opposed to when they are requested by the user or persisted to > > the config topic (details below). If no modifications to the namespace > have > > been made since the worker finished startup, they will be null." Does > that > > feel sufficient? It's a little indirect on the front of worker restart > > behavior, so let me know if that especially should be fleshed out a bit > > more (possibly by calling it out in the "Config topic records" section). > > > > > 2) Staying on the last_modified field, it's utility is also not too > clear > > to me. Can you add some examples of how it can be useful for debugging > etc? > > > > The cluster-wide API relaxes the consistency guarantees of the existing > > worker-local API. With the latter, users can be certain that once they > > receive a 2xx response, the logging level on that worker has been > updated. > > With the former, users know that the logging level will eventually be > > updated, but insight into the propagation of that update across the > cluster > > is limited. Although it's a little primitive, I'm hoping that the last > > modified timestamp will be enough to help shed some light on this > process. > > We could also explore exposing the provenance of logging levels (which > maps > > fairly cleanly to the scope of the request from which they originated), > but > > that feels like overkill at the moment. > > > > > 3) In the test plan, should we also add a test when the scope parameter > > passed is non-null and neither worker nor cluster? In this case the > > behaviour should be similar to the default case. > > > > Good call! Done. > > > > > 4) I had the same question as Yash regarding persistent cluster-wide > > logging level. I think you have explained it well and we can skip it for > > now. > > > > Thanks, I'm glad that this seems reasonable. I've tentatively added this > to > > the rejected alternatives section, but am happy to continue the > > conversation if anyone wants to explore that possibility further. > > > > Cheers, > > > > Chris > > > > On Tue, Sep 5, 2023 at 11:48 AM Sagar <sagarmeansoc...@gmail.com> wrote: > > > > > Hey Chris, > > > > > > Thanks for the KIP. Seems like a useful feature. I have some more > > > questions/comments: > > > > > > 1) Considering that you have now clarified here that last_modified > field > > > would be stored in-memory, it is not mentioned in the KIP. Although > > that's > > > something that's understandable, it wasn't apparent when reading the > KIP. > > > Probably, we could mention it? Also, what happens if a worker restarts? > > In > > > that case, since the log level update message would be pre-dating the > > > startup of the worker, it would be ignored? We should probably mention > > that > > > behaviour as well IMO. > > > > > > 2) Staying on the last_modified field, it's utility is also not too > clear > > > to me. Can you add some examples of how it can be useful for debugging > > etc? > > > > > > 3) In the test plan, should we also add a test when the scope parameter > > > passed is non-null and neither worker nor cluster? In this case the > > > behaviour should be similar to the default case. > > > > > > 4) I had the same question as Yash regarding persistent cluster-wide > > > logging level. I think you have explained it well and we can skip it > for > > > now. > > > > > > Thanks! > > > Sagar. > > > > > > On Tue, Sep 5, 2023 at 8:49 PM Chris Egerton <chr...@aiven.io.invalid> > > > wrote: > > > > > > > Hi all, > > > > > > > > Thank you so much for the generous review comments! Happy to see > > interest > > > > in this feature. Inline responses follow. > > > > > > > > > > > > Ashwin: > > > > > > > > > Don't you foresee a future scope type which may require cluster > > > metadata > > > > ? > > > > In that case, isn't it better to forward the requests to the leader > in > > > the > > > > initial implementation ? > > > > > > > > I agree with Yash here: we can cross that bridge when we come to it, > > > unless > > > > there are problems that we'd encounter then that would be better > > > addressed > > > > by adding request forwarding now. One potential argument I can think > of > > > is > > > > that the UX would be a little strange if the semantics for this > > endpoint > > > > differ depending on the value of the scope parameter (some values > would > > > be > > > > affected by in-progress rebalances, and some would not), but I don't > > know > > > > if making scope=cluster more brittle in the name of consistency with, > > > e.g., > > > > scope=connectorType:foo is really a worthwhile tradeoff. Thoughts? > > > > > > > > > I would also recommend an additional system test for Standalone > > herder > > > to > > > > ensure that the new scope parameter is honored and the response > > contains > > > > the last modified time. > > > > > > > > Ah, great call! I love the new testing plan section. I also share > > Yash's > > > > concerns about adding a new system test though (at this point, > they're > > so > > > > painful to write, test, and debug that in most circumstances I > consider > > > > them a last resort). Do you think it'd be reasonable to add > end-to-end > > > > verification for this with an integration test instead? > > > > > > > > > > > > Yash: > > > > > > > > > From the proposed changes section, it isn't very clear to me how > > we'll > > > be > > > > tracking this last modified timestamp to be returned in responses for > > the > > > > *GET > > > > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could > you > > > > please elaborate on this? Also, will we track the last modified > > timestamp > > > > even for worker scoped modifications where we won't write any records > > to > > > > the config topic and the requests will essentially be processed > > > > synchronously? > > > > > > > > RE timestamp tracking: I was thinking we'd store the timestamp for > each > > > > level in-memory and, whenever we change the level for a namespace, > > update > > > > its timestamp to the current wallclock time. Concretely, this means > > we'd > > > > want the timestamp for some logger `logger` to be as soon as possible > > > after > > > > the call to `logger.setLevel(level)` for some level `level`. I'm > > honestly > > > > unsure how to clarify this further in the KIP; is there anything in > > there > > > > that strikes you as particularly ambiguous that we can tweak to be > more > > > > clear? > > > > > > > > RE scope distinction for timestamps: I've updated the KIP to clarify > > this > > > > point, adding this sentence: "Timestamps will be updated regardless > of > > > > whether the namespace update was applied using scope=worker or > > > > scope=cluster.". Let me know what you think > > > > > > > > > In the current synchronous implementation for the *PUT > > > > /admin/loggers/{logger} *endpoint, we return a 404 error if the level > > is > > > > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new > > cluster > > > > scoped variant of the endpoint will be asynchronous, can we also add > a > > > > validation to synchronously surface erroneous log levels to users? > > > > > > > > Good call! I think we don't have to be explicit about this in the > > > proposed > > > > changes section, but it's a great fit for the testing plan, where > I've > > > > added it: "Ensure that cluster-scoped requests with invalid logging > > > levels > > > > are rejected with a 404 response" > > > > > > > > > I'm curious to know what the rationale here is? In KIP-745, the > > stated > > > > reasoning behind ignoring restart requests during worker startup was > > that > > > > the worker will anyway be starting all connectors and tasks assigned > to > > > it > > > > so the restart request is essentially meaningless. With the log level > > API > > > > however, wouldn't it make more sense to apply any cluster scoped > > > > modifications to new workers in the cluster too? This also applies to > > any > > > > workers that are restarted soon after a request is made to *PUT > > > > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we > > could > > > > stack up all the cluster scoped log level modification requests > during > > > the > > > > config topic read at worker startup and apply the latest ones per > > > namespace > > > > (assuming older records haven't already been compacted) after we've > > > > finished reading to the end of the config topic? > > > > > > > > It's definitely possible to implement persistent cluster-wide logging > > > level > > > > adjustments, with the possible exception that none will be applied > > until > > > > the worker has finished reading to the end of the config topic. The > > > > rationale for keeping these updates ephemeral is to continue to give > > > > priority to workers' Log4j configuration files, with the underlying > > > > philosophy that this endpoint is still only intended for debugging > > > > purposes, as opposed to cluster-wide configuration. Permanent changes > > can > > > > already be made by tweaking the Log4j file for a worker and then > > > restarting > > > > it. If a restart is too expensive for a permanent change, then the > > change > > > > can be applied immediately via the REST API, and staged via the Log4j > > > > configuration file (which will then be used the next time the worker > is > > > > restarted, whenever that happens). > > > > > > > > We could potentially explore persistent cluster-wide logging > > adjustments > > > in > > > > the future (or, if people feel strongly, right now), but it'd be more > > > > complex. How would we handle the case where a user has updated their > > > Log4j > > > > configuration file but continues to see the same logging level for a > > > > namespace, in a way that greatly reduces or even eliminates the odds > of > > > > headaches or footguns? How would we allow users to reset or undo > > > > cluster-wide configuration updates, in order to revert back to > > whatever's > > > > specified in their Log4j configuration file? Would we want to give > > users > > > > control over whether a dynamic update is persistent or ephemeral? > What > > > > would the implications be for finer-grained scoping (especially with > > > > regards to resetting dynamic changes)? Overall it doesn't seem worth > > the > > > > work to tackle all of this right now, as long as we're not painting > > > > ourselves into a corner. But as always, happy to hear what you and > > others > > > > think! > > > > > > > > > > > > Federico: > > > > > > > > > Due to the idempotent nature of PUT, I guess that the last_modified > > > > timestamp won't change if the same request is repeated successively. > > > > Should we add a unit test for that? > > > > > > > > Great call! I've added this to the testing plan with unit and > > end-to-end > > > > tests (seems fairly cheap to do both; LMK if e2e feels like overkill > > > > though). > > > > > > > > > > > > Thanks again for all of your thoughts! > > > > > > > > Cheers, > > > > > > > > Chris > > > > > > > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com > > > > > > wrote: > > > > > > > > > Hi Chris, thanks. This looks like a useful feature. > > > > > > > > > > Due to the idempotent nature of PUT, I guess that the last_modified > > > > > timestamp won't change if the same request is repeated > successively. > > > > > Should we add a unit test for that? > > > > > > > > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid > > > > > > > wrote: > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > Thanks for thinking about this useful feature ! > > > > > > I had a question regarding > > > > > > > > > > > > > Since cluster metadata is not required to handle these types of > > > > > request, > > > > > > they will not be forwarded to the leader > > > > > > > > > > > > And later, we also mention about supporting more scope types in > the > > > > > future. > > > > > > Don't you foresee a future scope type which may require cluster > > > > metadata > > > > > ? > > > > > > In that case, isn't it better to forward the requests to the > leader > > > in > > > > > the > > > > > > initial implementation ? > > > > > > > > > > > > I would also recommend an additional system test for Standalone > > > herder > > > > to > > > > > > ensure that the new scope parameter is honored and the response > > > > contains > > > > > > the last modified time. > > > > > > > > > > > > Thanks, > > > > > > Ashwin > > > > > > > > > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton > > <chr...@aiven.io.invalid > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > Can't imagine a worse time to publish a new KIP (it's late on a > > > > Friday > > > > > and > > > > > > > we're in the middle of the 3.6.0 release), but I wanted to put > > > forth > > > > > > > KIP-976 for discussion: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect > > > > > > > > > > > > > > TL;DR: The API to dynamically adjust log levels at runtime with > > > > > Connect is > > > > > > > great, and I'd like to augment it with support to adjust log > > levels > > > > for > > > > > > > every worker in the cluster (instead of just the worker that > > > receives > > > > > the > > > > > > > REST request). > > > > > > > > > > > > > > I look forward to everyone's thoughts, but expect that this > will > > > > > probably > > > > > > > take a bump or two once the dust has settled on 3.6.0. Huge > > thanks > > > to > > > > > > > everyone that's contributed to that release so far, especially > > our > > > > > release > > > > > > > manager Satish! > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > Chris > > > > > > > > > > > > > > > > > > > > > >