[ https://issues.apache.org/jira/browse/KAFKA-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Rao resolved KAFKA-6753. ---------------------------- Resolution: Fixed Fix Version/s: 2.1.0 Merged the PR to trunk. > Speed up event processing on the controller > -------------------------------------------- > > Key: KAFKA-6753 > URL: https://issues.apache.org/jira/browse/KAFKA-6753 > Project: Kafka > Issue Type: Improvement > Reporter: Lucas Wang > Assignee: Lucas Wang > Priority: Minor > Fix For: 2.1.0 > > Attachments: Screen Shot 2018-04-04 at 7.08.55 PM.png > > > The existing controller code updates metrics after processing every event. > This can slow down event processing on the controller tremendously. In one > profiling we see that updating metrics takes nearly 100% of the CPU for the > controller event processing thread. Specifically the slowness can be > attributed to two factors: > 1. Each invocation to update the metrics is expensive. Specifically trying to > calculate the offline partitions count requires iterating through all the > partitions in the cluster to check if the partition is offline; and > calculating the preferred replica imbalance count requires iterating through > all the partitions in the cluster to check if a partition has a leader other > than the preferred leader. In a large cluster, the number of partitions can > be quite large, all seen by the controller. Even if the time spent to check a > single partition is small, the accumulation effect of so many partitions in > the cluster can make the invocation to update metrics quite expensive. One > might argue that maybe the logic for processing each single partition is not > optimized, we checked the CPU percentage of leaf nodes in the profiling > result, and found that inside the loops of collection objects, e.g. the set > of all partitions, no single function dominates the processing. Hence the > large number of the partitions in a cluster is the main contributor to the > slowness of one invocation to update the metrics. > 2. The invocation to update metrics is called many times when the is a high > number of events to be processed by the controller, one invocation after > processing any event. > The patch that will be submitted tries to fix bullet 2 above, i.e. reducing > the number of invocations to update metrics. Instead of updating the metrics > after processing any event, we only periodically check if the metrics needs > to be updated, i.e. once every second. > * If after the previous invocation to update metrics, there are other types > of events that changed the controller’s state, then one second later the > metrics will be updated. > * If after the previous invocation, there has been no other types of events, > then the call to update metrics can be bypassed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)