[ https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748079#comment-17748079 ]
Divij Vaidya commented on KAFKA-15147: -------------------------------------- I missed this Jira earlier. Hey [~jeqo] Yes, these both are important metrics. We have implemented a tier-lag metric here - [https://github.com/satishd/kafka/commit/c96d3af4d02bf515a4355b14f33793211c5b3745] Since this is a new metric, it will require a KIP. One of my team mates will create a KIP and we can start the conversation there. We will also add delete lag over there in that KIP. Meanwhile, assigning this Jira to myself. > Measure pending and outstanding Remote Segment operations > --------------------------------------------------------- > > Key: KAFKA-15147 > URL: https://issues.apache.org/jira/browse/KAFKA-15147 > Project: Kafka > Issue Type: Improvement > Components: core > Reporter: Jorge Esteban Quilcate Otoya > Priority: Major > Labels: tiered-storage > > Remote Log Segment operations (copy/delete) are executed by the Remote > Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default > TopicBasedRLMM writes to the internal Kafka topic state changes on remote log > segments). > As executions run, fail, and retry; it will be important to know how many > operations are pending and outstanding over time to alert operators. > Pending operations are not enough to alert, as values can oscillate closer to > zero. An additional condition needs to apply (running time > threshold) to > consider an operation outstanding. > Proposal: > RemoteLogManager could be extended with 2 concurrent maps > (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure > segmentId time when operation started, and based on this expose 2 metrics per > operation: > * pendingSegmentCopies: gauge of pendingSegmentCopies map > * outstandingSegmentCopies: loop over pending ops, and if now - startedTime > > timeout, then outstanding++ (maybe on debug level?) > Is this a valuable metric to add to Tiered Storage? or better to solve on a > custom RLMM implementation? > Also, does it require a KIP? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)