[jira] [Created] (KAFKA-15969) Align RemoteStorageThreadPool metrics name with KIP-405

2023-12-04 Thread Lixin Yao (Jira)
Lixin Yao created KAFKA-15969:
-

 Summary: Align RemoteStorageThreadPool metrics name with KIP-405
 Key: KAFKA-15969
 URL: https://issues.apache.org/jira/browse/KAFKA-15969
 Project: Kafka
  Issue Type: Bug
  Components: metrics
Affects Versions: 3.6.0
Reporter: Lixin Yao
 Fix For: 3.7.0


In KIP-405, there are 2 metrics defined below:
^kafka.log.remote:type=RemoteStorageThreadPool, 
name=RemoteLogReaderTaskQueueSize^
and
^kafka.log.remote:type=RemoteStorageThreadPool, 
name=RemoteLogReaderAvgIdlePercent^

However, in Kafka 3.6 release, the actual metrics exposes are:
^org.apache.kafka.storage.internals.log:name=RemoteLogReaderAvgIdlePercent,type=RemoteStorageThreadPool^
^org.apache.kafka.storage.internals.log:name=RemoteLogReaderTaskQueueSize,type=RemoteStorageThreadPool^

The problem is the bean domain name is changed from ^{{kafka.log.remote}}^ to 
{{{}^org.apache.kafka.storage.internals.log^{}}}. And the type name is also 
changed.

We should either update the metrics path in KIP, or fix the path in the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15214) Add metrics for OffsetOutOfRangeException when tiered storage is enabled

2023-07-18 Thread Lixin Yao (Jira)
Lixin Yao created KAFKA-15214:
-

 Summary: Add metrics for OffsetOutOfRangeException when tiered 
storage is enabled
 Key: KAFKA-15214
 URL: https://issues.apache.org/jira/browse/KAFKA-15214
 Project: Kafka
  Issue Type: Improvement
  Components: metrics
Affects Versions: 3.6.0
Reporter: Lixin Yao
 Fix For: 3.6.0


In the current metrics RemoteReadErrorsPerSec, the exception type 
OffsetOutOfRangeException is not included.


In our testing with tiered storage feature, we noticed several cases where 
remote download is affected and stuck due to repeatedly 
OffsetOutOfRangeException in some particular broker or topic partitions. The 
root cause could be various but currently without a metrics it's very hard to 
catch this issue and debug in a timely fashion. It's understandable that the 
exception itself could not be the root cause but this exception metric could be 
a good metrics for us to alert and investigate.

Related discussion
[https://github.com/apache/kafka/pull/13944#discussion_r1266243006]

I am happy to contribute to this if the request is agreed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)