George Wu created KAFKA-19484:
---------------------------------

             Summary: Tiered Storage Quota Metrics can stop reporting
                 Key: KAFKA-19484
                 URL: https://issues.apache.org/jira/browse/KAFKA-19484
             Project: Kafka
          Issue Type: Bug
          Components: Tiered-Storage
    Affects Versions: 4.0.0, 3.9.0
         Environment: Ubuntu 22, Amazon Corretto Java 17
            Reporter: George Wu


It is possible for tiered storage throttle metrics (introduced as a part of 
[KIP-956|https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas])
 to stop reporting if the relevant tiered storage operation (copy/fetch) goes 
idle for longer than the sensor expiry timeout of one hour.

 

RemoteLogManager maintains a static reference to the sensors used for metric 
reporting. This is a problem because the default sensor expiry time is one hour 
and there is nothing responsible for handling expired sensors. If the sensors 
expire, RemoteLogManager will continue producing metrics through it's static 
references to sensor objects that have already been cleaned up by the 
ExpireSensorTask.

 

This issue tends to affect fetch metrics a lot more than copy metrics because 
the copy sensors don't go idle unless the topics stop being produced to. In 
contrast, the use case of backfilling from earliest offset using tiered storage 
is a pretty common use case.

 

*Reproduction*
 * Generate some amount of tiered storage fetch traffic on a topic. Confirm the 
remote-fetch-throttle-time-avg/max metrics are being reported.
 * Remove the consumer workload that triggers the tiered storage fetch traffic. 
Wait for one hour (the sensor expiration period)
 * Generate some more tiered storage fetch traffic. The metric will no longer 
report.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to