Hi Guozhang, KIP looks great, I have one suggestion: in addition to "restore-records-total", it would also be useful to track the number of records *remaining*, that is, the records that have not yet been restored. This is actually the metric I was attempting to implement in the StateRestoreListener that bumped me up against KAFKA-10575 :-)
With both a "restore-records-total" and a "restore-remaining-total" (or similar) metric, it's possible to derive useful information like the estimated time remaining for restoration (by dividing the remaining total by the restoration rate). Regards, Nick On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wangg...@gmail.com> wrote: > Hello Bruno, > > Thanks for your comments! > > 1. Regarding the metrics group name: originally I put > "stream-state-metrics" as it's related to state store restorations, but > after a second thought I think I agree with you that this is quite > confusing and not right. About the metrics groups, right now I have two > ideas: > > a) Still use the metric group name "stream-thread-metrics", but > differentiate with the processing threads on the thread id. The pros is > that we do not introduce a new group, the cons is that users who want to > separate processing from restoration/updating in the future needs to do > that on the thread id labels. > b) Introduce a new group name, for example "stream-state-updater-metrics" > and still have a thread-id label. We would be introducing a new group which > still have a thread-id, which may sound a bit weird (maybe if we do that we > could change the existing "stream-thread-metrics" into > "stream-processor-metrics"). > > Right now I'm leaning towards b) and maybe in the future rename > "thread-metrics" to "processor-metrics", LMK what do you think. > > 2. Regarding the metric names: today we may also pause a standby tasks, and > hence I'm trying to differentiate updating standbys from paused standbys. > Right now I'm thinking we can change "restoring-standby-tasks" to > "updating-standby-tasks", and change all other metrics' "restore" (except > the "restoring-active-tasks") to "state-update", a.k.a > "state-update-ratio", "state-update-records-total", > "updating-standby-tasks" etc. > > 3. Regarding the function name: yeah I think that's a valid concern, I can > change it to "onRestoreSuspended" since "Aborted" may confuse people that > previously called "onBatchRestored" are undone as part of the abortion. > > > Guozhang > > > > On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <cado...@apache.org> wrote: > > > Hi Guozhang, > > > > Thanks for the KIP! I think this KIP is a really nice addition to better > > understand what is going on in a Kafka Streams application. > > > > 1. > > The metric names "paused-active-tasks" and "paused-standby-tasks" might > > be a bit confusing since at least active tasks can be paused also > > outside of restoration. > > > > 2. > > Why is the type of the metrics "stream-state-metrics"? I would have > > expected "stream-thread-metrics" as the type. > > > > 3. > > Isn't the value of the metric "restoring-standby-tasks" simply the > > number of standby tasks since standby tasks are basically always > > updating (aka restoring)? > > > > 4. > > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics > > tailored to the upcoming state updater. They do not make much sense with > > a stream thread. Would it be better to introduce a new metrics level > > specifically for the state updater? > > > > 5. > > Personally, I do not like to use the word "restoration" together with > > standbys since restoration somehow implies that there is an offset for > > which the active task is considered restored and active processing can > > start. In other words, restoration is finite. Standby tasks rather > > update continuously their states. They can be up-to-date or lagging. I > > see that you could say "restored" instead of "up-to-date" and "not > > restored" instead of "lagging", but IMO it does not describe well the > > situation. That is a rather minor point. I just wanted to mention it. > > > > 6. > > The name "onRestorePaused()" might be confusing since in Kafka Streams > > users can also pause tasks. What about "onRestoreAborted()" or > > "onRestoreSuspended"? > > > > Best, > > Bruno > > > > > > On 16.09.22 19:33, Guozhang Wang wrote: > > > Hello everyone, > > > > > > I'd like to start a discussion for the following KIP, aiming to improve > > > Kafka Stream's restoration visibility via new metrics and callback > > methods: > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > > > > > > > > > Thanks! > > > -- Guozhang > > > > > > > > -- > -- Guozhang >