[ 
https://issues.apache.org/jira/browse/KAFKA-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18003877#comment-18003877
 ] 

George Wu commented on KAFKA-19484:
-----------------------------------

https://github.com/apache/kafka/pull/20129

> Tiered Storage Quota Metrics can stop reporting
> -----------------------------------------------
>
>                 Key: KAFKA-19484
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19484
>             Project: Kafka
>          Issue Type: Bug
>          Components: Tiered-Storage
>    Affects Versions: 3.9.0, 4.0.0
>         Environment: Ubuntu 22, Amazon Corretto Java 17
>            Reporter: George Wu
>            Priority: Minor
>
> It is possible for tiered storage throttle metrics (introduced as a part of 
> [KIP-956|https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas])
>  to stop reporting if the relevant tiered storage operation (copy/fetch) goes 
> idle for longer than the sensor expiry timeout of one hour.
>  
> RemoteLogManager maintains a static reference to the sensors used for metric 
> reporting. This is a problem because the default sensor expiry time is one 
> hour and there is nothing responsible for handling expired sensors. If the 
> sensors expire, RemoteLogManager will continue producing metrics through it's 
> static references to sensor objects that have already been cleaned up by the 
> ExpireSensorTask.
>  
> This issue tends to affect fetch metrics a lot more than copy metrics because 
> the copy sensors don't go idle unless the topics stop being produced to. In 
> contrast, the use case of backfilling from earliest offset using tiered 
> storage is a pretty common use case.
>  
> *Reproduction*
>  * Generate some amount of tiered storage fetch traffic on a topic. Confirm 
> the remote-fetch-throttle-time-avg/max metrics are being reported.
>  * Remove the consumer workload that triggers the tiered storage fetch 
> traffic. Wait for one hour (the sensor expiration period)
>  * Generate some more tiered storage fetch traffic. The metric will no longer 
> report.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to