Thanks, Christo! 1. Agree. Having a further look into how many latency metrics are included on the broker side there are only a few of them (e.g. request lifecycle) — but seems mostly delegated to clients, or plugin in this case, to measure this.
3.2. Personally, I find the record-based lag less useful as records can't be relied as a stable unit of measure. So, if we can keep bytes- and segment-based lag, LGTM. 3.4. Agree, these metrics should be on the broker side. Though if plugin decides to take deletion as a background process, then it should have it's own metrics. That's why I was thinking the calculation should be fairly similar to the CopyLag: "these segments are available for deletion but haven't been deleted yet" 3.5. For lag metrics: could we add an explanation on how each lag will be calculated, e.g. using which values, from which components, under which circumstances do we expect these values to increase/decrease, etc. This will clarify 3.4. and make it easier to agree and eventually test. 4. Sorry I wasn't clear. I meant similar to `RemoteCopyBytesPerSec` and `RemoteFetchBytesPerSec`, we could consider to include `RemoteDeleteBytesPerSec`. 5. and 6. Thanks for the explanation! It surely benefits to have these as part of the set of metrics. Cheers, Jorge. On Mon, 30 Oct 2023 at 16:07, Christo Lolov <christolo...@gmail.com> wrote: > Heya Jorge, > > Thank you for the insightful comments! > > 1. I see a value in such latency metrics but in my opinion the correct > location for such metrics is in the plugins providing the underlying > functionality. What are your thoughts on the matter? > > 2. Okay, I will look for and adjust the formatting today/tomorrow! > > 3.1 Done. > 3.2 Sure, I will add this to the KIP later today, the suggestion makes > sense to me. However, my question is, would you still find value in > emitting metrics for all three i.e. RemoteCopyLagRecords, > RemoteCopyLagBytes and RemoteCopyLagSegments or would you only keep > RemoteCopyLagBytes and RemoteCopyLagSegments? > 3.3. Yes, RemoteDeleteLagRecords was supposed to be an equivalent of > RemoteCopyLagRecords. Once I have your opinion on 3.2 I will make the > respective changes. > 3.4. I envision these metrics to be added to Kafka rather than the plugins. > Today Kafka sends deletes to remote storage but does not know whether those > segments have been deleted immediately when the request has been sent or > have been given to a background process to carry out the actual reclamation > of space. The purpose of this metric is to give an estimate in time which > says "hey, we have called this many segments or bytes to be deleted". > > 4. I believe this goes down the same line of thinking as what you mentioned > in 3.3 - have I misunderstood something? > > 5. I have on a number of occasions found I do not have a metric to quickly > point me to what part of tiered storage functionality is experiencing an > issue, in some scenarios a follower failing to build an auxiliary state. An > increase in the number of BuildRemoteLogAuxState requests per second can > uncover problems for specific topics warranting a further investigation, > which I tend to find difficult to judge purely based on parsing log > statements. An increase in the number of errors can quickly zone in on > followers failing as part of tiered storage and point me to look in the > logs specifically for that component. > > 6. I find it useful start my investigations with respect to tiering > problems by checking the rough size distribution of topics in remote. From > then on I try to correlate whether a historically high-volume topic started > experiencing a decrease in volume due to a decrease in produce traffic to > that topic or due to an increase in lag on local storage due to the broker > slowing down for whatever reason. Besides correlation I would use such a > metric to also confirm whether my rate calculations are correct i.e. if > topic A receives X MB/s and rolls a segment every Y seconds with an upload > rate of Z MB/s do I see that much data actually being written in remote > storage. Do these two scenarios demonstrate the usefulness I would have > from such a metric and do the benefits make sense to you? > > 7. I agree. I have changed TotalRemoteLogSizeComputationTime, > TotalRemoteLogSizeBytes, and TotalRemoteLogMetadataCount to > RemoteLogSizeComputationTime, RemoteLogSizeBytes and RemoteLogMetadataCount > respectively. > > On Fri, 27 Oct 2023 at 15:24, Jorge Esteban Quilcate Otoya < > quilcate.jo...@gmail.com> wrote: > > > Hi Christo, > > > > Thanks for proposing KIP, this metrics will certainly be useful to > operate > > Kafka Tiered Storage as it becomes production-ready. > > > > 1. Given that the scope of the KIPs has grown to cover more metrics, what > > do you think about introducing latency metrics for RSM operations? > > Copy and delete time metrics are quite obvious/simple on what they > > represent; but fetch latency metrics would be helpful as remote fetching > > clients directly. e.g. having a "time to first byte" metric could help to > > know how much time is introduced by the remote tier to start serving > > results to the consumer, or measuring how long it takes to return a > > response to consumers. > > > > 2. Couple of requests/nits on the metrics table, could you: > > - highlight the names (or have them on a separate column, as you prefer) > to > > make it easier to read? If you choose to have another column, maybe sort > > them as "Name, Description, MBean" and adjust the width. > > - group the related metrics in separate groups, e.g. Lag, Remote Delete, > > Remote Log Aux State, Remote Log Size; so we can elaborate on why these > set > > of metrics are needed. Maybe adding some examples on usage and how > > actionable they are as the ones shared in previous emails would be useful > > to keep as part of the KIP. > > > > 3. On Lag metrics: > > 3.1 I would suggest the following renames: > > - TotalRemoteRecordsLag -> RemoteCopyLagRecords > > - TotalRemoteBytesLag -> RemoteCopyLagBytes > > - DeleteRemoteLag -> RemoteDeleteLagRecords > > 3.2. I agree with Kamal that having a lag based on the number of segments > > would be useful to include. Segments could give a faster proxy to > > understand whether the lag is meaningful or not. e.g. if the number of > > records and bytes are high, but the segment lag is only small (e.g. 1), > it > > may be ok; but if the number of segments is high, then it can be more > > relevant to operators. > > 3.3. Could we consider having the same metrics for Delete Lag as there > are > > for Copy Lag? i.e. including RemoteDeleteLagBytes, and (potentially) > > RemoteDeleteLag for segments. > > 3.4. The description of delete lag is unclear to me: I though it was > about > > the remote segments to be deleted (because of total retention) but not > > deleted yet; however from description it seems that it's related to local > > segments that are marked for deletion. Is this correct? > > > > 4. On Remote Delete metrics: > > - Could we also include bytes-based metric as with Copy and Fetch? t > would > > be useful to know how many bytes are being deleted. If aggregated and > > compared with copied bytes, we can get a sense of the amount of data > stored > > remotely, even if not exact (only considers segment size, not indexes) > > > > 5. On RemoteLogAuxState metrics: could you elaborate a bit more on the > > purpose of this component and why the metrics proposed are needed? > > > > 6. On Total Remote Log Size metrics: similarly, could you elaborate on > why > > this metric is needed? I'm missing what makes this operation as relevant > > (compared to other internals) to have some metrics attached -- maybe if > you > > could shared scenarios where this metrics would be useful would be > helpful. > > > > 7. On the metrics naming: not sure the `Total*` prefix is really needed > or > > adds meaning. When I found it useful is when there are related metric > that > > are a subset, then the total prefix helps: e.g. > > `TotalProduceRequestsPerSec` and `FailedProduceRequestsPerSec` > > > > Cheers, > > Jorge. > > > > > > On Tue, 24 Oct 2023 at 12:24, Christo Lolov <christolo...@gmail.com> > > wrote: > > > > > Hello all, > > > > > > Now that 3.6 has been released, I would like to bring back attention to > > the > > > following KIP for adding metrics to tiered storage targeting 3.7 - > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Add+more+metrics+to+Tiered+Storage > > > . > > > > > > Let me know your thoughts about the list of metrics and their > > granularity! > > > > > > Best, > > > Christo > > > > > > On Fri, 13 Oct 2023 at 10:14, Christo Lolov <christolo...@gmail.com> > > > wrote: > > > > > > > Heya Gantigmaa, > > > > > > > > Apologies for the (very) late reply! > > > > > > > > Now that 3.6 has been released and reviewers have a bit more time I > > will > > > > be picking up this KIP again. I am more than happy to add useful new > > > > metrics to the KIP, I would just ask for a couple of days to review > > your > > > > pull request and I will come back to you. > > > > > > > > Best, > > > > Christo > > > > > > > > On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge < > gsele...@redhat.com> > > > > wrote: > > > > > > > >> Hi Christo, > > > >> > > > >> Thank you for writing the KIP. > > > >> > > > >> I recently raised a PR to add metrics for tracking remote segment > > > >> deletions > > > >> (https://github.com/apache/kafka/pull/14375) but realised those > > metrics > > > >> were not mentioned in the original KIP-405 or KIP-930. Do you think > > > these > > > >> would make sense to be added to this KIP and get included in the > > > >> discussion? > > > >> > > > >> Regards, > > > >> Gantigmaa > > > >> > > > >> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov < > christolo...@gmail.com> > > > >> wrote: > > > >> > > > >> > Heya Kamal, > > > >> > > > > >> > Thank you for going through the KIP and for the question! > > > >> > > > > >> > I have been thinking about this and as an operator I might find it > > the > > > >> most > > > >> > useful to know all three of them actually. > > > >> > > > > >> > I would find knowing the size in bytes useful to determine how > much > > > >> disk I > > > >> > might need to add temporarily to compensate for the slowdown. > > > >> > I would find knowing the number of records useful, because using > the > > > >> > MessagesInPerSec metric I would be able to determine how old the > > > records > > > >> > which are facing problems are. > > > >> > I would find knowing the number of segments useful because I would > > be > > > >> able > > > >> > to correlate this with whether I need to change > > > >> > *remote.log.manager.task.interval.ms > > > >> > <http://remote.log.manager.task.interval.ms> *to a lower or > higher > > > >> value. > > > >> > > > > >> > What are your thoughts on the above? Would you find some of them > > more > > > >> > useful than others? > > > >> > > > > >> > Best, > > > >> > Christo > > > >> > > > > >> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash < > > > >> > kamal.chandraprak...@gmail.com> wrote: > > > >> > > > > >> > > Hi Christo, > > > >> > > > > > >> > > Thanks for the KIP! > > > >> > > > > > >> > > The proposed tiered storage metrics are useful. The unit > mentioned > > > in > > > >> the > > > >> > > KIP is the number of records. > > > >> > > Each topic can have varying amounts of records in a segment > > > depending > > > >> on > > > >> > > the record size. > > > >> > > > > > >> > > Do you think having the tier-lag by number of segments (or) size > > of > > > >> > > segments in bytes will be useful > > > >> > > to the operator? > > > >> > > > > > >> > > Thanks, > > > >> > > Kamal > > > >> > > > > > >> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov < > > > christolo...@gmail.com> > > > >> > > wrote: > > > >> > > > > > >> > > > Hello all! > > > >> > > > > > > >> > > > I would like to start a discussion for KIP-963: Upload and > > delete > > > >> lag > > > >> > > > metrics in Tiered Storage ( > > > >> > https://cwiki.apache.org/confluence/x/sZGzDw > > > >> > > ). > > > >> > > > > > > >> > > > The purpose of this KIP is to introduce a couple of metrics to > > > track > > > >> > lag > > > >> > > > with respect to remote storage from the point of view of > Kafka. > > > >> > > > > > > >> > > > Thanks in advance for leaving a review! > > > >> > > > > > > >> > > > Best, > > > >> > > > Christo > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > >