[ 
https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748079#comment-17748079
 ] 

Divij Vaidya commented on KAFKA-15147:
--------------------------------------

I missed this Jira earlier. 

Hey [~jeqo] 
Yes, these both are important metrics. We have implemented a tier-lag metric 
here - 
[https://github.com/satishd/kafka/commit/c96d3af4d02bf515a4355b14f33793211c5b3745]
 
Since this is a new metric, it will require a KIP. One of my team mates will 
create a KIP and we can start the conversation there.

We will also add delete lag over there in that KIP.

Meanwhile, assigning this Jira to myself.

> Measure pending and outstanding Remote Segment operations
> ---------------------------------------------------------
>
>                 Key: KAFKA-15147
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15147
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>            Reporter: Jorge Esteban Quilcate Otoya
>            Priority: Major
>              Labels: tiered-storage
>
> Remote Log Segment operations (copy/delete) are executed by the Remote 
> Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default 
> TopicBasedRLMM writes to the internal Kafka topic state changes on remote log 
> segments).
> As executions run, fail, and retry; it will be important to know how many 
> operations are pending and outstanding over time to alert operators.
> Pending operations are not enough to alert, as values can oscillate closer to 
> zero. An additional condition needs to apply (running time > threshold) to 
> consider an operation outstanding.
> Proposal:
> RemoteLogManager could be extended with 2 concurrent maps 
> (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure 
> segmentId time when operation started, and based on this expose 2 metrics per 
> operation:
>  * pendingSegmentCopies: gauge of pendingSegmentCopies map
>  * outstandingSegmentCopies: loop over pending ops, and if now - startedTime 
> > timeout, then outstanding++ (maybe on debug level?)
> Is this a valuable metric to add to Tiered Storage? or better to solve on a 
> custom RLMM implementation?
> Also, does it require a KIP?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to