[ 
https://issues.apache.org/jira/browse/KAFKA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744347#comment-17744347
 ] 

Lixin Yao commented on KAFKA-15214:
-----------------------------------

Here is one of the example scenarios motivates me to create this request. When 
I am testing the tiered storage feature, I noticed unbalanced byte out traffic 
rate across brokers. From the available metrics, it's very confusing because 
all the topic partitions should be balanced across cluster without any skew. I 
then check the existing error metrics RemoteReadErrorsPerSec, there is only 0 
value reported. This confuse me with the impression that there is no problem on 
remote downloading. so I have to deep dive into logs for more information. At 
the end, what I find is one broker is not able to fetch remote segments because 
of this OffsetOutOfRangeException exception consistently happening repeatedly. 
Example error looks like this:
{code:java}
2023-07-15 00:12:42,471 kafkaLogLevel="INFO" [RemoteLogReader-2]: 
OffsetOutOfRangeException occurred while reading the remote data for 
mytopic-247: org.apache.kafka.common.errors.OffsetOutOfRangeException: Received 
request for offset 0 for leader epoch 0 and partition mytopic-247 which does 
not exist in remote tier. Try again later. 
kafkaLoggerClass="kafka.log.remote.RemoteLogReader" 
kafkaLoggerThread="RemoteLogReader-2"   {code}
Like I said why this partition is requesting for offset 0 repeatedly could be 
due to other reason, e.g. corrupted metadata or other issues, but if this error 
is included as part of RemoteReadErrorsPerSec metrics, it could help me a lot 
on identifying the root cause and setup alerting.

Hope this makes sense to you. I am ok to include this as a tag on existing 
metrics. As long as I have a way to quickly identify and alert on the abnormal 
behavior, I am ok with it.  

> Add metrics for OffsetOutOfRangeException when tiered storage is enabled
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-15214
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15214
>             Project: Kafka
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 3.6.0
>            Reporter: Lixin Yao
>            Priority: Minor
>              Labels: KIP-405
>             Fix For: 3.6.0
>
>
> In the current metrics RemoteReadErrorsPerSec, the exception type 
> OffsetOutOfRangeException is not included.
> In our testing with tiered storage feature (at Apple), we noticed several 
> cases where remote download is affected and stuck due to repeatedly 
> OffsetOutOfRangeException in some particular broker or topic partitions. The 
> root cause could be various but currently without a metrics it's very hard to 
> catch this issue and debug in a timely fashion. It's understandable that the 
> exception itself could not be the root cause but this exception metric could 
> be a good metrics for us to alert and investigate.
> Related discussion
> [https://github.com/apache/kafka/pull/13944#discussion_r1266243006]
> I am happy to contribute to this if the request is agreed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to