[ 
https://issues.apache.org/jira/browse/KAFKA-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Wang updated KAFKA-6753:
------------------------------
    Description: 
The existing controller code updates metrics after processing every event. This 
can slow down event processing on the controller tremendously. In one profiling 
we see that updating metrics takes nearly 100% of the CPU for the controller 
event processing thread. Specifically the slowness can be attributed to two 
factors:
1. Each invocation to update the metrics is expensive. Specifically trying to 
calculate the offline partitions count requires iterating through all the 
partitions in the cluster to check if the partition is offline; and calculating 
the preferred replica imbalance count requires iterating through all the 
partitions in the cluster to check if a partition has a leader other than the 
preferred leader. In a large cluster, the number of partitions can be quite 
large, all seen by the controller. Even if the time spent to check a single 
partition is small, the accumulation effect of so many partitions in the 
cluster can make the invocation to update metrics quite expensive. One might 
argue that maybe the logic for processing each single partition is not 
optimized, we checked the CPU percentage of leaf nodes in the profiling result, 
and found that inside the loops of collection objects, e.g. the set of all 
partitions, no single function dominates the processing. Hence the large number 
of the partitions in a cluster is the main contributor to the slowness of one 
invocation to update the metrics.
2. The invocation to update metrics is called many times when the is a high 
number of events to be processed by the controller, one invocation after 
processing any event.

This ticket is used to track how we change the logic for the 
OfflinePartitionCount metric.
And KAFKA-7350 will be used to track progress for the 
PreferredReplicaImbalanceCount metric.



  was:
The existing controller code updates metrics after processing every event. This 
can slow down event processing on the controller tremendously. In one profiling 
we see that updating metrics takes nearly 100% of the CPU for the controller 
event processing thread. Specifically the slowness can be attributed to two 
factors:
1. Each invocation to update the metrics is expensive. Specifically trying to 
calculate the offline partitions count requires iterating through all the 
partitions in the cluster to check if the partition is offline; and calculating 
the preferred replica imbalance count requires iterating through all the 
partitions in the cluster to check if a partition has a leader other than the 
preferred leader. In a large cluster, the number of partitions can be quite 
large, all seen by the controller. Even if the time spent to check a single 
partition is small, the accumulation effect of so many partitions in the 
cluster can make the invocation to update metrics quite expensive. One might 
argue that maybe the logic for processing each single partition is not 
optimized, we checked the CPU percentage of leaf nodes in the profiling result, 
and found that inside the loops of collection objects, e.g. the set of all 
partitions, no single function dominates the processing. Hence the large number 
of the partitions in a cluster is the main contributor to the slowness of one 
invocation to update the metrics.
2. The invocation to update metrics is called many times when the is a high 
number of events to be processed by the controller, one invocation after 
processing any event.

This ticket is used to track how we change the logic for the 
OfflinePartitionCount metric.




> Speed up event processing on the controller 
> --------------------------------------------
>
>                 Key: KAFKA-6753
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6753
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Lucas Wang
>            Assignee: Lucas Wang
>            Priority: Minor
>             Fix For: 2.1.0
>
>         Attachments: Screen Shot 2018-04-04 at 7.08.55 PM.png
>
>
> The existing controller code updates metrics after processing every event. 
> This can slow down event processing on the controller tremendously. In one 
> profiling we see that updating metrics takes nearly 100% of the CPU for the 
> controller event processing thread. Specifically the slowness can be 
> attributed to two factors:
> 1. Each invocation to update the metrics is expensive. Specifically trying to 
> calculate the offline partitions count requires iterating through all the 
> partitions in the cluster to check if the partition is offline; and 
> calculating the preferred replica imbalance count requires iterating through 
> all the partitions in the cluster to check if a partition has a leader other 
> than the preferred leader. In a large cluster, the number of partitions can 
> be quite large, all seen by the controller. Even if the time spent to check a 
> single partition is small, the accumulation effect of so many partitions in 
> the cluster can make the invocation to update metrics quite expensive. One 
> might argue that maybe the logic for processing each single partition is not 
> optimized, we checked the CPU percentage of leaf nodes in the profiling 
> result, and found that inside the loops of collection objects, e.g. the set 
> of all partitions, no single function dominates the processing. Hence the 
> large number of the partitions in a cluster is the main contributor to the 
> slowness of one invocation to update the metrics.
> 2. The invocation to update metrics is called many times when the is a high 
> number of events to be processed by the controller, one invocation after 
> processing any event.
> This ticket is used to track how we change the logic for the 
> OfflinePartitionCount metric.
> And KAFKA-7350 will be used to track progress for the 
> PreferredReplicaImbalanceCount metric.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to