George Wu created KAFKA-19484: --------------------------------- Summary: Tiered Storage Quota Metrics can stop reporting Key: KAFKA-19484 URL: https://issues.apache.org/jira/browse/KAFKA-19484 Project: Kafka Issue Type: Bug Components: Tiered-Storage Affects Versions: 4.0.0, 3.9.0 Environment: Ubuntu 22, Amazon Corretto Java 17 Reporter: George Wu
It is possible for tiered storage throttle metrics (introduced as a part of [KIP-956|https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas]) to stop reporting if the relevant tiered storage operation (copy/fetch) goes idle for longer than the sensor expiry timeout of one hour. RemoteLogManager maintains a static reference to the sensors used for metric reporting. This is a problem because the default sensor expiry time is one hour and there is nothing responsible for handling expired sensors. If the sensors expire, RemoteLogManager will continue producing metrics through it's static references to sensor objects that have already been cleaned up by the ExpireSensorTask. This issue tends to affect fetch metrics a lot more than copy metrics because the copy sensors don't go idle unless the topics stop being produced to. In contrast, the use case of backfilling from earliest offset using tiered storage is a pretty common use case. *Reproduction* * Generate some amount of tiered storage fetch traffic on a topic. Confirm the remote-fetch-throttle-time-avg/max metrics are being reported. * Remove the consumer workload that triggers the tiered storage fetch traffic. Wait for one hour (the sensor expiration period) * Generate some more tiered storage fetch traffic. The metric will no longer report. -- This message was sent by Atlassian Jira (v8.20.10#820010)