[ https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796808#comment-17796808 ]
Satish Duggana commented on KAFKA-15147: ---------------------------------------- [~enether] These are minor improvements, we can target them to 3.7.0. Christo, Luke etal are working on PRs and we plan to review and merge them. > Measure pending and outstanding Remote Segment operations > --------------------------------------------------------- > > Key: KAFKA-15147 > URL: https://issues.apache.org/jira/browse/KAFKA-15147 > Project: Kafka > Issue Type: Improvement > Components: core > Reporter: Jorge Esteban Quilcate Otoya > Assignee: Christo Lolov > Priority: Major > Labels: tiered-storage > Fix For: 3.7.0 > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Upload+and+delete+lag+metrics+in+Tiered+Storage > > KAFKA-15833: RemoteCopyLagBytes > KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, > RemoteDeleteLagSegments > KAFKA-16013: ExpiresPerSec > KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, > RemoteLogMetadataCount > KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, > BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec > ==== > Remote Log Segment operations (copy/delete) are executed by the Remote > Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default > TopicBasedRLMM writes to the internal Kafka topic state changes on remote log > segments). > As executions run, fail, and retry; it will be important to know how many > operations are pending and outstanding over time to alert operators. > Pending operations are not enough to alert, as values can oscillate closer to > zero. An additional condition needs to apply (running time > threshold) to > consider an operation outstanding. > Proposal: > RemoteLogManager could be extended with 2 concurrent maps > (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure > segmentId time when operation started, and based on this expose 2 metrics per > operation: > * pendingSegmentCopies: gauge of pendingSegmentCopies map > * outstandingSegmentCopies: loop over pending ops, and if now - startedTime > > timeout, then outstanding++ (maybe on debug level?) > Is this a valuable metric to add to Tiered Storage? or better to solve on a > custom RLMM implementation? > Also, does it require a KIP? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)