Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Jorge Esteban Quilcate Otoya Wed, 01 Nov 2023 06:37:11 -0700

Thanks, Christo!

1. Agree. Having a further look into how many latency metrics are included
on the broker side there are only a few of them (e.g. request lifecycle) —
but seems mostly delegated to clients, or plugin in this case, to measure
this.


3.2. Personally, I find the record-based lag less useful as records can't
be relied as a stable unit of measure. So, if we can keep bytes- and
segment-based lag, LGTM.
3.4.  Agree, these metrics should be on the broker side. Though if plugin
decides to take deletion as a background process, then it should have it's
own metrics. That's why I was thinking the calculation should be fairly
similar to the CopyLag: "these segments are available for deletion but
haven't been deleted yet"
3.5. For lag metrics: could we add an explanation on how each lag will be
calculated, e.g. using which values, from which components, under which
circumstances do we expect these values to increase/decrease, etc. This
will clarify 3.4. and make it easier to agree and eventually test.

4. Sorry I wasn't clear. I meant similar to `RemoteCopyBytesPerSec` and
`RemoteFetchBytesPerSec`, we could consider to include
`RemoteDeleteBytesPerSec`.

5. and 6. Thanks for the explanation! It surely benefits to have these as
part of the set of metrics.

Cheers,
Jorge.

On Mon, 30 Oct 2023 at 16:07, Christo Lolov <christolo...@gmail.com> wrote:

> Heya Jorge,
>
> Thank you for the insightful comments!
>
> 1. I see a value in such latency metrics but in my opinion the correct
> location for such metrics is in the plugins providing the underlying
> functionality. What are your thoughts on the matter?
>
> 2. Okay, I will look for and adjust the formatting today/tomorrow!
>
> 3.1 Done.
> 3.2 Sure, I will add this to the KIP later today, the suggestion makes
> sense to me. However, my question is, would you still find value in
> emitting metrics for all three i.e. RemoteCopyLagRecords,
> RemoteCopyLagBytes and RemoteCopyLagSegments or would you only keep
> RemoteCopyLagBytes and RemoteCopyLagSegments?
> 3.3. Yes, RemoteDeleteLagRecords was supposed to be an equivalent of
> RemoteCopyLagRecords. Once I have your opinion on 3.2 I will make the
> respective changes.
> 3.4. I envision these metrics to be added to Kafka rather than the plugins.
> Today Kafka sends deletes to remote storage but does not know whether those
> segments have been deleted immediately when the request has been sent or
> have been given to a background process to carry out the actual reclamation
> of space. The purpose of this metric is to give an estimate in time which
> says "hey, we have called this many segments or bytes to be deleted".
>
> 4. I believe this goes down the same line of thinking as what you mentioned
> in 3.3 - have I misunderstood something?
>
> 5. I have on a number of occasions found I do not have a metric to quickly
> point me to what part of tiered storage functionality is experiencing an
> issue, in some scenarios a follower failing to build an auxiliary state. An
> increase in the number of BuildRemoteLogAuxState requests per second can
> uncover problems for specific topics warranting a further investigation,
> which I tend to find difficult to judge purely based on parsing log
> statements. An increase in the number of errors can quickly zone in on
> followers failing as part of tiered storage and point me to look in the
> logs specifically for that component.
>
> 6. I find it useful start my investigations with respect to tiering
> problems by checking the rough size distribution of topics in remote. From
> then on I try to correlate whether a historically high-volume topic started
> experiencing a decrease in volume due to a decrease in produce traffic to
> that topic or due to an increase in lag on local storage due to the broker
> slowing down for whatever reason. Besides correlation I would use such a
> metric to also confirm whether my rate calculations are correct i.e. if
> topic A receives X MB/s and rolls a segment every Y seconds with an upload
> rate of Z MB/s do I see that much data actually being written in remote
> storage. Do these two scenarios demonstrate the usefulness I would have
> from such a metric and do the benefits make sense to you?
>
> 7. I agree. I have changed TotalRemoteLogSizeComputationTime,
> TotalRemoteLogSizeBytes, and TotalRemoteLogMetadataCount to
> RemoteLogSizeComputationTime, RemoteLogSizeBytes and RemoteLogMetadataCount
> respectively.
>
> On Fri, 27 Oct 2023 at 15:24, Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Hi Christo,
> >
> > Thanks for proposing KIP, this metrics will certainly be useful to
> operate
> > Kafka Tiered Storage as it becomes production-ready.
> >
> > 1. Given that the scope of the KIPs has grown to cover more metrics, what
> > do you think about introducing latency metrics for RSM operations?
> > Copy and delete time metrics are quite obvious/simple on what they
> > represent; but fetch latency metrics would be helpful as remote fetching
> > clients directly. e.g. having a "time to first byte" metric could help to
> > know how much time is introduced by the remote tier to start serving
> > results to the consumer, or measuring how long it takes to return a
> > response to consumers.
> >
> > 2. Couple of requests/nits on the metrics table, could you:
> > - highlight the names (or have them on a separate column, as you prefer)
> to
> > make it easier to read? If you choose to have another column, maybe sort
> > them as "Name, Description, MBean" and adjust the width.
> > - group the related metrics in separate groups, e.g. Lag, Remote Delete,
> > Remote Log Aux State, Remote Log Size; so we can elaborate on why these
> set
> > of metrics are needed. Maybe adding some examples on usage and how
> > actionable they are as the ones shared in previous emails would be useful
> > to keep as part of the KIP.
> >
> > 3. On Lag metrics:
> > 3.1 I would suggest the following renames:
> > - TotalRemoteRecordsLag -> RemoteCopyLagRecords
> > - TotalRemoteBytesLag -> RemoteCopyLagBytes
> > - DeleteRemoteLag -> RemoteDeleteLagRecords
> > 3.2. I agree with Kamal that having a lag based on the number of segments
> > would be useful to include. Segments could give a faster proxy to
> > understand whether the lag is meaningful or not. e.g. if the number of
> > records and bytes are high, but the segment lag is only small (e.g. 1),
> it
> > may be ok; but if the number of segments is high, then it can be more
> > relevant to operators.
> > 3.3. Could we consider having the same metrics for Delete Lag as there
> are
> > for Copy Lag? i.e. including RemoteDeleteLagBytes, and (potentially)
> > RemoteDeleteLag for segments.
> > 3.4. The description of delete lag is unclear to me: I though it was
> about
> > the remote segments to be deleted (because of total retention) but not
> > deleted yet; however from description it seems that it's related to local
> > segments that are marked for deletion. Is this correct?
> >
> > 4. On Remote Delete metrics:
> > - Could we also include bytes-based metric as with Copy and Fetch? t
> would
> > be useful to know how many bytes are being deleted. If aggregated and
> > compared with copied bytes, we can get a sense of the amount of data
> stored
> > remotely, even if not exact (only considers segment size, not indexes)
> >
> > 5. On RemoteLogAuxState metrics: could you elaborate a bit more on the
> > purpose of this component and why the metrics proposed are needed?
> >
> > 6. On Total Remote Log Size metrics: similarly, could you elaborate on
> why
> > this metric is needed? I'm missing what makes this operation as relevant
> > (compared to other internals) to have some metrics attached -- maybe if
> you
> > could shared scenarios where this metrics would be useful would be
> helpful.
> >
> > 7. On the metrics naming: not sure the `Total*` prefix is really needed
> or
> > adds meaning. When I found it useful is when there are related metric
> that
> > are a subset, then the total prefix helps: e.g.
> > `TotalProduceRequestsPerSec` and `FailedProduceRequestsPerSec`
> >
> > Cheers,
> > Jorge.
> >
> >
> > On Tue, 24 Oct 2023 at 12:24, Christo Lolov <christolo...@gmail.com>
> > wrote:
> >
> > > Hello all,
> > >
> > > Now that 3.6 has been released, I would like to bring back attention to
> > the
> > > following KIP for adding metrics to tiered storage targeting 3.7 -
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Add+more+metrics+to+Tiered+Storage
> > > .
> > >
> > > Let me know your thoughts about the list of metrics and their
> > granularity!
> > >
> > > Best,
> > > Christo
> > >
> > > On Fri, 13 Oct 2023 at 10:14, Christo Lolov <christolo...@gmail.com>
> > > wrote:
> > >
> > > > Heya Gantigmaa,
> > > >
> > > > Apologies for the (very) late reply!
> > > >
> > > > Now that 3.6 has been released and reviewers have a bit more time I
> > will
> > > > be picking up this KIP again. I am more than happy to add useful new
> > > > metrics to the KIP, I would just ask for a couple of days to review
> > your
> > > > pull request and I will come back to you.
> > > >
> > > > Best,
> > > > Christo
> > > >
> > > > On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge <
> gsele...@redhat.com>
> > > > wrote:
> > > >
> > > >> Hi Christo,
> > > >>
> > > >> Thank you for writing the KIP.
> > > >>
> > > >> I recently raised a PR to add metrics for tracking remote segment
> > > >> deletions
> > > >> (https://github.com/apache/kafka/pull/14375) but realised those
> > metrics
> > > >> were not mentioned in the original KIP-405 or KIP-930. Do you think
> > > these
> > > >> would make sense to be added to this KIP and get included in the
> > > >> discussion?
> > > >>
> > > >> Regards,
> > > >> Gantigmaa
> > > >>
> > > >> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov <
> christolo...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Heya Kamal,
> > > >> >
> > > >> > Thank you for going through the KIP and for the question!
> > > >> >
> > > >> > I have been thinking about this and as an operator I might find it
> > the
> > > >> most
> > > >> > useful to know all three of them actually.
> > > >> >
> > > >> > I would find knowing the size in bytes useful to determine how
> much
> > > >> disk I
> > > >> > might need to add temporarily to compensate for the slowdown.
> > > >> > I would find knowing the number of records useful, because using
> the
> > > >> > MessagesInPerSec metric I would be able to determine how old the
> > > records
> > > >> > which are facing problems are.
> > > >> > I would find knowing the number of segments useful because I would
> > be
> > > >> able
> > > >> > to correlate this with whether I need to change
> > > >> > *remote.log.manager.task.interval.ms
> > > >> > <http://remote.log.manager.task.interval.ms> *to a lower or
> higher
> > > >> value.
> > > >> >
> > > >> > What are your thoughts on the above? Would you find some of them
> > more
> > > >> > useful than others?
> > > >> >
> > > >> > Best,
> > > >> > Christo
> > > >> >
> > > >> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash <
> > > >> > kamal.chandraprak...@gmail.com> wrote:
> > > >> >
> > > >> > > Hi Christo,
> > > >> > >
> > > >> > > Thanks for the KIP!
> > > >> > >
> > > >> > > The proposed tiered storage metrics are useful. The unit
> mentioned
> > > in
> > > >> the
> > > >> > > KIP is the number of records.
> > > >> > > Each topic can have varying amounts of records in a segment
> > > depending
> > > >> on
> > > >> > > the record size.
> > > >> > >
> > > >> > > Do you think having the tier-lag by number of segments (or) size
> > of
> > > >> > > segments in bytes will be useful
> > > >> > > to the operator?
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Kamal
> > > >> > >
> > > >> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov <
> > > christolo...@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hello all!
> > > >> > > >
> > > >> > > > I would like to start a discussion for KIP-963: Upload and
> > delete
> > > >> lag
> > > >> > > > metrics in Tiered Storage (
> > > >> > https://cwiki.apache.org/confluence/x/sZGzDw
> > > >> > > ).
> > > >> > > >
> > > >> > > > The purpose of this KIP is to introduce a couple of metrics to
> > > track
> > > >> > lag
> > > >> > > > with respect to remote storage from the point of view of
> Kafka.
> > > >> > > >
> > > >> > > > Thanks in advance for leaving a review!
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Christo
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Reply via email to