[ 
https://issues.apache.org/jira/browse/KAFKA-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804732#comment-15804732
 ] 

Edoardo Comar commented on KAFKA-4441:
--------------------------------------

Raising the severity because we've seen this spurious metric values many times 
in the systems we monitor, to the point that the metric became not trusted 
unless the values persisted for some time.


> Kafka Monitoring is incorrect during rapid topic creation and deletion
> ----------------------------------------------------------------------
>
>                 Key: KAFKA-4441
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4441
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.0.0, 0.10.0.1
>            Reporter: Tom Crayford
>            Assignee: Edoardo Comar
>
> Kafka reports several metrics off the state of partitions:
> UnderReplicatedPartitions
> PreferredReplicaImbalanceCount
> OfflinePartitionsCount
> All of these metrics trigger when rapidly creating and deleting topics in a 
> tight loop, although the actual causes of the metrics firing are from topics 
> that are undergoing creation/deletion, and the cluster is otherwise stable.
> Looking through the source code, topic deletion goes through an asynchronous 
> state machine: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/TopicDeletionManager.scala#L35.
> However, the metrics do not know about the progress of this state machine: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L185
>  
> I believe the fix to this is relatively simple - we need to make the metrics 
> know that a topic is currently undergoing deletion or creation, and only 
> include topics that are "stable"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to