Re: [DISCUSS] KIP-818: Introduce cache-size-bytes-total Task Level Metric

Sagar Fri, 04 Feb 2022 02:13:08 -0800

Thanks Sophie/Guozhang.

Yeah I could have amended the KIP but it slipped my mind when Guozhang
proposed this in the PR. Later on, the PR was merged and KIP was marked as
adopted so I thought I will write a new one. I know the PR had been
reopened now :p . I dont have much preference on a new KIP v/s the original
one so anything is ok with me as well.


I agree with the INFO part. I will make that change.

Regarding task level, from my understanding, since every task's
buffer/cache size might be different so if a certain task might be
overshooting the limits then the task level metric might help people to
infer this. Also, thanks for the explanation Guozhang on why this should be
a task level metric. What are your thoughts on this @Sophie?

Thanks!
Sagar.


On Fri, Feb 4, 2022 at 4:47 AM Guozhang Wang <wangg...@gmail.com> wrote:

> Thanks Sagar for proposing the KIP, and Sophie for sharing your thoughts.
> Here're my 2c:
>
> I think I agree with Sophie for making the two metrics (both the added and
> the newly proposed) on INFO level since we are always calculating them
> anyways. Regarding the level of the cache-size though, I'm thinking a bit
> different with you two: today we do not actually keep that caches on the
> per-store level, but rather on the per-thread level, i.e. when the cache is
> full we would flush not only on the triggering state store but also
> potentially on other state stores as well of the task that thread owns.
> This mechanism, in hindsight, is a bit weird and we have some discussions
> about refactoring that in the future already. Personally I'd like to make
> this new metric to be aligned with whatever our future design will be.
>
> In the long run if we would not have a static assignment from tasks to
> threads, it may not make sense to keep a dedicated cache pool per thread.
> Instead all tasks will be dynamically sharing the globally configured max
> cache size (dynamically here means, we would not just divide the total size
> by the num.tasks and then assign that to each task), and when a cache put
> triggers the flushing because the sum now exceeds the global configured
> value, we would potentially flush all the cached records for that task. If
> this is the end stage, then I think keeping this metric at the task level
> is good.
>
>
>
> Guozhang
>
>
>
>
> On Thu, Feb 3, 2022 at 10:15 AM Sophie Blee-Goldman
> <sop...@confluent.io.invalid> wrote:
>
> > Hey Sagar,  thanks for the KIP!
> >
> > And yes, all metrics are considered part of the public API and thus
> require
> > a KIP to add (or modify, etc...) Although in this particular case, you
> > could probably make a good case for just considering it as an update to
> the
> > original KIP which added the analogous metric `input-buffer-bytes-total`.
> > For  things like this that weren't considered during the KIP proposal but
> > came up during the implementation or review, and are small changes that
> > would have made sense to include in that KIP had they been thought of,
> you
> > can just send an update to the existing KIP's discussion and.or voting
> > thread that explains what you want to add or modify and maybe a brief
> > description why.
> >
> > It's always ok to make a new KIP when in doubt, but there are some cases
> > where an update email is sufficient. If there are any concerns or
> > suggestions that significantly expand the scope of the update, you can
> > always go create a new KIP and move the discussion there.
> >
> > I'd say you can feel free to proceed in whichever way you'd prefer for
> this
> > new proposal -- it just needs to appear in some KIP somewhere, and have
> > given the community thew opportunity to discuss and provide feedback on.
> >
> > On that note, I do have two suggestions:
> >
> > 1)  since we need to measure the size of the cache (and the input
> buffer(s)
> > anyways, we may as well make `cache-size-bytes-total` -- and also the new
> > input-buffer-bytes-total -- an INFO level metric. In general the more
> > metrics the merrier, the only real reason for disabling some are if they
> > have a performance impact or other cost that not everyone will want to
> pay.
> > In this case we're already computing the value of these metrics, so why
> not
> > expose it to the user as an INFO metric
> > 2) I think it would be both more natural and easier to implement if this
> > was a store-level metric. A single task could in theory contain multiple
> > physical state store caches and we would have to roll them up to report
> the
> > size for the task as a whole. It's additional work just to lose some
> > information that the user may want to have
> >
> > Let me know if anything here doesn't make sense or needs clarification.
> And
> > thanks for the quick followup to get this 2nd metric!
> >
> > -Sophie
> >
> > On Sat, Jan 29, 2022 at 4:27 AM Sagar <sagarmeansoc...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I would like to open a discussion thread on the following KIP:
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186878390
> > >
> > > PS: This is about introducing a new metric and I am assuming that it
> > > requires a KIP. If that isn't the case, I can close it.
> > >
> > > Thanks!
> > > Sagar.
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-818: Introduce cache-size-bytes-total Task Level Metric

Reply via email to