Hi Guozhang,

KIP looks great, I have one suggestion: in addition to
"restore-records-total", it would also be useful to track the number of
records *remaining*, that is, the records that have not yet been restored.
This is actually the metric I was attempting to implement in the
StateRestoreListener that bumped me up against KAFKA-10575 :-)

With both a "restore-records-total" and a "restore-remaining-total" (or
similar) metric, it's possible to derive useful information like the
estimated time remaining for restoration (by dividing the remaining total
by the restoration rate).

Regards,

Nick

On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wangg...@gmail.com> wrote:

> Hello Bruno,
>
> Thanks for your comments!
>
> 1. Regarding the metrics group name: originally I put
> "stream-state-metrics" as it's related to state store restorations, but
> after a second thought I think I agree with you that this is quite
> confusing and not right. About the metrics groups, right now I have two
> ideas:
>
> a) Still use the metric group name "stream-thread-metrics", but
> differentiate with the processing threads on the thread id. The pros is
> that we do not introduce a new group, the cons is that users who want to
> separate processing from restoration/updating in the future needs to do
> that on the thread id labels.
> b) Introduce a new group name, for example "stream-state-updater-metrics"
> and still have a thread-id label. We would be introducing a new group which
> still have a thread-id, which may sound a bit weird (maybe if we do that we
> could change the existing "stream-thread-metrics" into
> "stream-processor-metrics").
>
> Right now I'm leaning towards b) and maybe in the future rename
> "thread-metrics" to "processor-metrics", LMK what do you think.
>
> 2. Regarding the metric names: today we may also pause a standby tasks, and
> hence I'm trying to differentiate updating standbys from paused standbys.
> Right now I'm thinking we can change "restoring-standby-tasks" to
> "updating-standby-tasks", and change all other metrics' "restore" (except
> the "restoring-active-tasks") to "state-update", a.k.a
> "state-update-ratio", "state-update-records-total",
> "updating-standby-tasks" etc.
>
> 3. Regarding the function name: yeah I think that's a valid concern, I can
> change it to "onRestoreSuspended" since "Aborted" may confuse people that
> previously called "onBatchRestored" are undone as part of the abortion.
>
>
> Guozhang
>
>
>
> On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <cado...@apache.org> wrote:
>
> > Hi Guozhang,
> >
> > Thanks for the KIP! I think this KIP is a really nice addition to better
> > understand what is going on in a Kafka Streams application.
> >
> > 1.
> > The metric names "paused-active-tasks" and "paused-standby-tasks" might
> > be a bit confusing since at least active tasks can be paused also
> > outside of restoration.
> >
> > 2.
> > Why is the type of the metrics "stream-state-metrics"? I would have
> > expected "stream-thread-metrics" as the type.
> >
> > 3.
> > Isn't the value of the metric "restoring-standby-tasks" simply the
> > number of standby tasks since standby tasks are basically always
> > updating (aka restoring)?
> >
> > 4.
> > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> > tailored to the upcoming state updater. They do not make much sense with
> > a stream thread. Would it be better to introduce a new metrics level
> > specifically for the state updater?
> >
> > 5.
> > Personally, I do not like to use the word "restoration" together with
> > standbys since restoration somehow implies that there is an offset for
> > which the active task is considered restored and active processing can
> > start. In other words, restoration is finite. Standby tasks rather
> > update continuously their states. They can be up-to-date or lagging. I
> > see that you could say "restored" instead of "up-to-date" and "not
> > restored" instead of "lagging", but IMO it does not describe well the
> > situation. That is a rather minor point. I just wanted to mention it.
> >
> > 6.
> > The name "onRestorePaused()" might be confusing since in Kafka Streams
> > users can also pause tasks. What about "onRestoreAborted()" or
> > "onRestoreSuspended"?
> >
> > Best,
> > Bruno
> >
> >
> > On 16.09.22 19:33, Guozhang Wang wrote:
> > > Hello everyone,
> > >
> > > I'd like to start a discussion for the following KIP, aiming to improve
> > > Kafka Stream's restoration visibility via new metrics and callback
> > methods:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >
> > >
> > > Thanks!
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Reply via email to