Hi Sagar, Thanks for your thoughts! Responses inline:
> 1) Considering that you have now clarified here that last_modified field would be stored in-memory, it is not mentioned in the KIP. Although that's something that's understandable, it wasn't apparent when reading the KIP. Probably, we could mention it? Also, what happens if a worker restarts? In that case, since the log level update message would be pre-dating the startup of the worker, it would be ignored? We should probably mention that behaviour as well IMO. I've tweaked the second paragraph of the "Last modified timestamp" section to try to clarify this without getting too verbose: "Modification times will be tracked in-memory and determined by when they are applied by the worker, as opposed to when they are requested by the user or persisted to the config topic (details below). If no modifications to the namespace have been made since the worker finished startup, they will be null." Does that feel sufficient? It's a little indirect on the front of worker restart behavior, so let me know if that especially should be fleshed out a bit more (possibly by calling it out in the "Config topic records" section). > 2) Staying on the last_modified field, it's utility is also not too clear to me. Can you add some examples of how it can be useful for debugging etc? The cluster-wide API relaxes the consistency guarantees of the existing worker-local API. With the latter, users can be certain that once they receive a 2xx response, the logging level on that worker has been updated. With the former, users know that the logging level will eventually be updated, but insight into the propagation of that update across the cluster is limited. Although it's a little primitive, I'm hoping that the last modified timestamp will be enough to help shed some light on this process. We could also explore exposing the provenance of logging levels (which maps fairly cleanly to the scope of the request from which they originated), but that feels like overkill at the moment. > 3) In the test plan, should we also add a test when the scope parameter passed is non-null and neither worker nor cluster? In this case the behaviour should be similar to the default case. Good call! Done. > 4) I had the same question as Yash regarding persistent cluster-wide logging level. I think you have explained it well and we can skip it for now. Thanks, I'm glad that this seems reasonable. I've tentatively added this to the rejected alternatives section, but am happy to continue the conversation if anyone wants to explore that possibility further. Cheers, Chris On Tue, Sep 5, 2023 at 11:48 AM Sagar <sagarmeansoc...@gmail.com> wrote: > Hey Chris, > > Thanks for the KIP. Seems like a useful feature. I have some more > questions/comments: > > 1) Considering that you have now clarified here that last_modified field > would be stored in-memory, it is not mentioned in the KIP. Although that's > something that's understandable, it wasn't apparent when reading the KIP. > Probably, we could mention it? Also, what happens if a worker restarts? In > that case, since the log level update message would be pre-dating the > startup of the worker, it would be ignored? We should probably mention that > behaviour as well IMO. > > 2) Staying on the last_modified field, it's utility is also not too clear > to me. Can you add some examples of how it can be useful for debugging etc? > > 3) In the test plan, should we also add a test when the scope parameter > passed is non-null and neither worker nor cluster? In this case the > behaviour should be similar to the default case. > > 4) I had the same question as Yash regarding persistent cluster-wide > logging level. I think you have explained it well and we can skip it for > now. > > Thanks! > Sagar. > > On Tue, Sep 5, 2023 at 8:49 PM Chris Egerton <chr...@aiven.io.invalid> > wrote: > > > Hi all, > > > > Thank you so much for the generous review comments! Happy to see interest > > in this feature. Inline responses follow. > > > > > > Ashwin: > > > > > Don't you foresee a future scope type which may require cluster > metadata > > ? > > In that case, isn't it better to forward the requests to the leader in > the > > initial implementation ? > > > > I agree with Yash here: we can cross that bridge when we come to it, > unless > > there are problems that we'd encounter then that would be better > addressed > > by adding request forwarding now. One potential argument I can think of > is > > that the UX would be a little strange if the semantics for this endpoint > > differ depending on the value of the scope parameter (some values would > be > > affected by in-progress rebalances, and some would not), but I don't know > > if making scope=cluster more brittle in the name of consistency with, > e.g., > > scope=connectorType:foo is really a worthwhile tradeoff. Thoughts? > > > > > I would also recommend an additional system test for Standalone herder > to > > ensure that the new scope parameter is honored and the response contains > > the last modified time. > > > > Ah, great call! I love the new testing plan section. I also share Yash's > > concerns about adding a new system test though (at this point, they're so > > painful to write, test, and debug that in most circumstances I consider > > them a last resort). Do you think it'd be reasonable to add end-to-end > > verification for this with an integration test instead? > > > > > > Yash: > > > > > From the proposed changes section, it isn't very clear to me how we'll > be > > tracking this last modified timestamp to be returned in responses for the > > *GET > > /admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could you > > please elaborate on this? Also, will we track the last modified timestamp > > even for worker scoped modifications where we won't write any records to > > the config topic and the requests will essentially be processed > > synchronously? > > > > RE timestamp tracking: I was thinking we'd store the timestamp for each > > level in-memory and, whenever we change the level for a namespace, update > > its timestamp to the current wallclock time. Concretely, this means we'd > > want the timestamp for some logger `logger` to be as soon as possible > after > > the call to `logger.setLevel(level)` for some level `level`. I'm honestly > > unsure how to clarify this further in the KIP; is there anything in there > > that strikes you as particularly ambiguous that we can tweak to be more > > clear? > > > > RE scope distinction for timestamps: I've updated the KIP to clarify this > > point, adding this sentence: "Timestamps will be updated regardless of > > whether the namespace update was applied using scope=worker or > > scope=cluster.". Let me know what you think > > > > > In the current synchronous implementation for the *PUT > > /admin/loggers/{logger} *endpoint, we return a 404 error if the level is > > invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new cluster > > scoped variant of the endpoint will be asynchronous, can we also add a > > validation to synchronously surface erroneous log levels to users? > > > > Good call! I think we don't have to be explicit about this in the > proposed > > changes section, but it's a great fit for the testing plan, where I've > > added it: "Ensure that cluster-scoped requests with invalid logging > levels > > are rejected with a 404 response" > > > > > I'm curious to know what the rationale here is? In KIP-745, the stated > > reasoning behind ignoring restart requests during worker startup was that > > the worker will anyway be starting all connectors and tasks assigned to > it > > so the restart request is essentially meaningless. With the log level API > > however, wouldn't it make more sense to apply any cluster scoped > > modifications to new workers in the cluster too? This also applies to any > > workers that are restarted soon after a request is made to *PUT > > /admin/loggers/{logger}?scope=cluster *on another worker. Maybe we could > > stack up all the cluster scoped log level modification requests during > the > > config topic read at worker startup and apply the latest ones per > namespace > > (assuming older records haven't already been compacted) after we've > > finished reading to the end of the config topic? > > > > It's definitely possible to implement persistent cluster-wide logging > level > > adjustments, with the possible exception that none will be applied until > > the worker has finished reading to the end of the config topic. The > > rationale for keeping these updates ephemeral is to continue to give > > priority to workers' Log4j configuration files, with the underlying > > philosophy that this endpoint is still only intended for debugging > > purposes, as opposed to cluster-wide configuration. Permanent changes can > > already be made by tweaking the Log4j file for a worker and then > restarting > > it. If a restart is too expensive for a permanent change, then the change > > can be applied immediately via the REST API, and staged via the Log4j > > configuration file (which will then be used the next time the worker is > > restarted, whenever that happens). > > > > We could potentially explore persistent cluster-wide logging adjustments > in > > the future (or, if people feel strongly, right now), but it'd be more > > complex. How would we handle the case where a user has updated their > Log4j > > configuration file but continues to see the same logging level for a > > namespace, in a way that greatly reduces or even eliminates the odds of > > headaches or footguns? How would we allow users to reset or undo > > cluster-wide configuration updates, in order to revert back to whatever's > > specified in their Log4j configuration file? Would we want to give users > > control over whether a dynamic update is persistent or ephemeral? What > > would the implications be for finer-grained scoping (especially with > > regards to resetting dynamic changes)? Overall it doesn't seem worth the > > work to tackle all of this right now, as long as we're not painting > > ourselves into a corner. But as always, happy to hear what you and others > > think! > > > > > > Federico: > > > > > Due to the idempotent nature of PUT, I guess that the last_modified > > timestamp won't change if the same request is repeated successively. > > Should we add a unit test for that? > > > > Great call! I've added this to the testing plan with unit and end-to-end > > tests (seems fairly cheap to do both; LMK if e2e feels like overkill > > though). > > > > > > Thanks again for all of your thoughts! > > > > Cheers, > > > > Chris > > > > On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com> > > wrote: > > > > > Hi Chris, thanks. This looks like a useful feature. > > > > > > Due to the idempotent nature of PUT, I guess that the last_modified > > > timestamp won't change if the same request is repeated successively. > > > Should we add a unit test for that? > > > > > > On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid> > > > wrote: > > > > > > > > Hi Chris, > > > > > > > > Thanks for thinking about this useful feature ! > > > > I had a question regarding > > > > > > > > > Since cluster metadata is not required to handle these types of > > > request, > > > > they will not be forwarded to the leader > > > > > > > > And later, we also mention about supporting more scope types in the > > > future. > > > > Don't you foresee a future scope type which may require cluster > > metadata > > > ? > > > > In that case, isn't it better to forward the requests to the leader > in > > > the > > > > initial implementation ? > > > > > > > > I would also recommend an additional system test for Standalone > herder > > to > > > > ensure that the new scope parameter is honored and the response > > contains > > > > the last modified time. > > > > > > > > Thanks, > > > > Ashwin > > > > > > > > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton <chr...@aiven.io.invalid > > > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > Can't imagine a worse time to publish a new KIP (it's late on a > > Friday > > > and > > > > > we're in the middle of the 3.6.0 release), but I wanted to put > forth > > > > > KIP-976 for discussion: > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect > > > > > > > > > > TL;DR: The API to dynamically adjust log levels at runtime with > > > Connect is > > > > > great, and I'd like to augment it with support to adjust log levels > > for > > > > > every worker in the cluster (instead of just the worker that > receives > > > the > > > > > REST request). > > > > > > > > > > I look forward to everyone's thoughts, but expect that this will > > > probably > > > > > take a bump or two once the dust has settled on 3.6.0. Huge thanks > to > > > > > everyone that's contributed to that release so far, especially our > > > release > > > > > manager Satish! > > > > > > > > > > Cheers, > > > > > > > > > > Chris > > > > > > > > > > >