[ https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884787#comment-16884787 ]
GEORGE LI commented on KAFKA-8638: ---------------------------------- Here is the KIP: [KIP-491|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982] > Preferred Leader Blacklist (deprioritized list) > ----------------------------------------------- > > Key: KAFKA-8638 > URL: https://issues.apache.org/jira/browse/KAFKA-8638 > Project: Kafka > Issue Type: Improvement > Components: config, controller, core > Affects Versions: 1.1.1, 2.3.0, 2.2.1 > Reporter: GEORGE LI > Assignee: GEORGE LI > Priority: Major > > Currently, the kafka preferred leader election will pick the broker_id in the > topic/partition replica assignments in a priority order when the broker is in > ISR. The preferred leader is the broker id in the first position of replica. > There are use-cases that, even the first broker in the replica assignment is > in ISR, there is a need for it to be moved to the end of ordering (lowest > priority) when deciding leadership during preferred leader election. > Let’s use topic/partition replica (1,2,3) as an example. 1 is the preferred > leader. When preferred leadership is run, it will pick 1 as the leader if > it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not in ISR, > then pick 3 as the leader. There are use cases that, even 1 is in ISR, we > would like it to be moved to the end of ordering (lowest priority) when > deciding leadership during preferred leader election. Below is a list of > use cases: > * (If broker_id 1 is a swapped failed host and brought up with last segments > or latest offset without historical data (There is another effort on this), > it's better for it to not serve leadership till it's caught-up. > * The cross-data center cluster has AWS instances which have less computing > power than the on-prem bare metal machines. We could put the AWS broker_ids > in Preferred Leader Blacklist, so on-prem brokers can be elected leaders, > without changing the reassignments ordering of the replicas. > * If the broker_id 1 is constantly losing leadership after some time: > "Flapping". we would want to exclude 1 to be a leader unless all other > brokers of this topic/partition are offline. The “Flapping” effect was seen > in the past when 2 or more brokers were bad, when they lost leadership > constantly/quickly, the sets of partition replicas they belong to will see > leadership constantly changing. The ultimate solution is to swap these bad > hosts. But for quick mitigation, we can also put the bad hosts in the > Preferred Leader Blacklist to move the priority of its being elected as > leaders to the lowest. > * If the controller is busy serving an extra load of metadata requests and > other tasks. we would like to put the controller's leaders to other brokers > to lower its CPU load. currently bouncing to lose leadership would not work > for Controller, because after the bounce, the controller fails over to > another broker. > * Avoid bouncing broker in order to lose its leadership: it would be good if > we have a way to specify which broker should be excluded from serving > traffic/leadership (without changing the replica assignment ordering by > reassignments, even though that's quick), and run preferred leader election. > A bouncing broker will cause temporary URP, and sometimes other issues. Also > a bouncing of broker (e.g. broker_id 1) can temporarily lose all its > leadership, but if another broker (e.g. broker_id 2) fails or gets bounced, > some of its leaderships will likely failover to broker_id 1 on a replica with > 3 brokers. If broker_id 1 is in the blacklist, then in such a scenario even > broker_id 2 offline, the 3rd broker can take leadership. > The current work-around of the above is to change the topic/partition's > replica reassignments to move the broker_id 1 from the first position to the > last position and run preferred leader election. e.g. (1, 2, 3) => (2, 3, 1). > This changes the replica reassignments, and we need to keep track of the > original one and restore if things change (e.g. controller fails over to > another broker, the swapped empty broker caught up). That’s a rather tedious > task. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)