Re: [DISCUSS] KIP-818: Introduce cache-size-bytes-total Task Level Metric

Guozhang Wang Thu, 10 Feb 2022 22:57:12 -0800

Thanks for the updates Sagar!

On Thu, Feb 10, 2022 at 5:59 PM Sagar <sagarmeansoc...@gmail.com> wrote:


> Hi All,
>
> As discussed above, this KIP would be discarded and the new metric proposed
> here would be added to KIP-770 as the need to add a new metric was
> discovered when working on it.
>
> Thanks!
> Sagar.
>
> On Thu, Feb 10, 2022 at 9:54 AM Sagar <sagarmeansoc...@gmail.com> wrote:
>
> > Hi Guozhang,
> >
> > Sure. I will add it to the KIP.
> >
> > Thanks!
> > Sagar.
> >
> > On Mon, Feb 7, 2022 at 6:22 AM Guozhang Wang <wangg...@gmail.com> wrote:
> >
> >> Since the PR is reopened and we are going to re-merged the fixed PRs,
> what
> >> about just adding that as part of the KIP as the addendum?
> >>
> >> On Fri, Feb 4, 2022 at 2:13 AM Sagar <sagarmeansoc...@gmail.com> wrote:
> >>
> >> > Thanks Sophie/Guozhang.
> >> >
> >> > Yeah I could have amended the KIP but it slipped my mind when Guozhang
> >> > proposed this in the PR. Later on, the PR was merged and KIP was
> marked
> >> as
> >> > adopted so I thought I will write a new one. I know the PR had been
> >> > reopened now :p . I dont have much preference on a new KIP v/s the
> >> original
> >> > one so anything is ok with me as well.
> >> >
> >> > I agree with the INFO part. I will make that change.
> >> >
> >> > Regarding task level, from my understanding, since every task's
> >> > buffer/cache size might be different so if a certain task might be
> >> > overshooting the limits then the task level metric might help people
> to
> >> > infer this. Also, thanks for the explanation Guozhang on why this
> >> should be
> >> > a task level metric. What are your thoughts on this @Sophie?
> >> >
> >> > Thanks!
> >> > Sagar.
> >> >
> >> >
> >> > On Fri, Feb 4, 2022 at 4:47 AM Guozhang Wang <wangg...@gmail.com>
> >> wrote:
> >> >
> >> > > Thanks Sagar for proposing the KIP, and Sophie for sharing your
> >> thoughts.
> >> > > Here're my 2c:
> >> > >
> >> > > I think I agree with Sophie for making the two metrics (both the
> added
> >> > and
> >> > > the newly proposed) on INFO level since we are always calculating
> them
> >> > > anyways. Regarding the level of the cache-size though, I'm thinking
> a
> >> bit
> >> > > different with you two: today we do not actually keep that caches on
> >> the
> >> > > per-store level, but rather on the per-thread level, i.e. when the
> >> cache
> >> > is
> >> > > full we would flush not only on the triggering state store but also
> >> > > potentially on other state stores as well of the task that thread
> >> owns.
> >> > > This mechanism, in hindsight, is a bit weird and we have some
> >> discussions
> >> > > about refactoring that in the future already. Personally I'd like to
> >> make
> >> > > this new metric to be aligned with whatever our future design will
> be.
> >> > >
> >> > > In the long run if we would not have a static assignment from tasks
> to
> >> > > threads, it may not make sense to keep a dedicated cache pool per
> >> thread.
> >> > > Instead all tasks will be dynamically sharing the globally
> configured
> >> max
> >> > > cache size (dynamically here means, we would not just divide the
> total
> >> > size
> >> > > by the num.tasks and then assign that to each task), and when a
> cache
> >> put
> >> > > triggers the flushing because the sum now exceeds the global
> >> configured
> >> > > value, we would potentially flush all the cached records for that
> >> task.
> >> > If
> >> > > this is the end stage, then I think keeping this metric at the task
> >> level
> >> > > is good.
> >> > >
> >> > >
> >> > >
> >> > > Guozhang
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Feb 3, 2022 at 10:15 AM Sophie Blee-Goldman
> >> > > <sop...@confluent.io.invalid> wrote:
> >> > >
> >> > > > Hey Sagar,  thanks for the KIP!
> >> > > >
> >> > > > And yes, all metrics are considered part of the public API and
> thus
> >> > > require
> >> > > > a KIP to add (or modify, etc...) Although in this particular case,
> >> you
> >> > > > could probably make a good case for just considering it as an
> >> update to
> >> > > the
> >> > > > original KIP which added the analogous metric
> >> > `input-buffer-bytes-total`.
> >> > > > For  things like this that weren't considered during the KIP
> >> proposal
> >> > but
> >> > > > came up during the implementation or review, and are small changes
> >> that
> >> > > > would have made sense to include in that KIP had they been thought
> >> of,
> >> > > you
> >> > > > can just send an update to the existing KIP's discussion and.or
> >> voting
> >> > > > thread that explains what you want to add or modify and maybe a
> >> brief
> >> > > > description why.
> >> > > >
> >> > > > It's always ok to make a new KIP when in doubt, but there are some
> >> > cases
> >> > > > where an update email is sufficient. If there are any concerns or
> >> > > > suggestions that significantly expand the scope of the update, you
> >> can
> >> > > > always go create a new KIP and move the discussion there.
> >> > > >
> >> > > > I'd say you can feel free to proceed in whichever way you'd prefer
> >> for
> >> > > this
> >> > > > new proposal -- it just needs to appear in some KIP somewhere, and
> >> have
> >> > > > given the community thew opportunity to discuss and provide
> feedback
> >> > on.
> >> > > >
> >> > > > On that note, I do have two suggestions:
> >> > > >
> >> > > > 1)  since we need to measure the size of the cache (and the input
> >> > > buffer(s)
> >> > > > anyways, we may as well make `cache-size-bytes-total` -- and also
> >> the
> >> > new
> >> > > > input-buffer-bytes-total -- an INFO level metric. In general the
> >> more
> >> > > > metrics the merrier, the only real reason for disabling some are
> if
> >> > they
> >> > > > have a performance impact or other cost that not everyone will
> want
> >> to
> >> > > pay.
> >> > > > In this case we're already computing the value of these metrics,
> so
> >> why
> >> > > not
> >> > > > expose it to the user as an INFO metric
> >> > > > 2) I think it would be both more natural and easier to implement
> if
> >> > this
> >> > > > was a store-level metric. A single task could in theory contain
> >> > multiple
> >> > > > physical state store caches and we would have to roll them up to
> >> report
> >> > > the
> >> > > > size for the task as a whole. It's additional work just to lose
> some
> >> > > > information that the user may want to have
> >> > > >
> >> > > > Let me know if anything here doesn't make sense or needs
> >> clarification.
> >> > > And
> >> > > > thanks for the quick followup to get this 2nd metric!
> >> > > >
> >> > > > -Sophie
> >> > > >
> >> > > > On Sat, Jan 29, 2022 at 4:27 AM Sagar <sagarmeansoc...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Hi All,
> >> > > > >
> >> > > > > I would like to open a discussion thread on the following KIP:
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186878390
> >> > > > >
> >> > > > > PS: This is about introducing a new metric and I am assuming
> that
> >> it
> >> > > > > requires a KIP. If that isn't the case, I can close it.
> >> > > > >
> >> > > > > Thanks!
> >> > > > > Sagar.
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > -- Guozhang
> >> > >
> >> >
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-818: Introduce cache-size-bytes-total Task Level Metric

Reply via email to