[ https://issues.apache.org/jira/browse/KAFKA-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588186#comment-16588186 ]
Lucas Wang commented on KAFKA-6753: ----------------------------------- [~junrao] Personally I'm in favor of removing this metric compared with giving possibly incorrect/stale metric. I can start the KIP to collect more feedback. Meanwhile I'll keep this ticket open for tracking progress of the remaining work. > Speed up event processing on the controller > -------------------------------------------- > > Key: KAFKA-6753 > URL: https://issues.apache.org/jira/browse/KAFKA-6753 > Project: Kafka > Issue Type: Improvement > Reporter: Lucas Wang > Assignee: Lucas Wang > Priority: Minor > Fix For: 2.1.0 > > Attachments: Screen Shot 2018-04-04 at 7.08.55 PM.png > > > The existing controller code updates metrics after processing every event. > This can slow down event processing on the controller tremendously. In one > profiling we see that updating metrics takes nearly 100% of the CPU for the > controller event processing thread. Specifically the slowness can be > attributed to two factors: > 1. Each invocation to update the metrics is expensive. Specifically trying to > calculate the offline partitions count requires iterating through all the > partitions in the cluster to check if the partition is offline; and > calculating the preferred replica imbalance count requires iterating through > all the partitions in the cluster to check if a partition has a leader other > than the preferred leader. In a large cluster, the number of partitions can > be quite large, all seen by the controller. Even if the time spent to check a > single partition is small, the accumulation effect of so many partitions in > the cluster can make the invocation to update metrics quite expensive. One > might argue that maybe the logic for processing each single partition is not > optimized, we checked the CPU percentage of leaf nodes in the profiling > result, and found that inside the loops of collection objects, e.g. the set > of all partitions, no single function dominates the processing. Hence the > large number of the partitions in a cluster is the main contributor to the > slowness of one invocation to update the metrics. > 2. The invocation to update metrics is called many times when the is a high > number of events to be processed by the controller, one invocation after > processing any event. -- This message was sent by Atlassian JIRA (v7.6.3#76005)