[ 
https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GEORGE LI updated KAFKA-8638:
-----------------------------
    Description: 
Currently, the kafka preferred leader election will pick the broker_id in the 
topic/partition replica assignments in a priority order when the broker is in 
ISR. The preferred leader is the broker id in the first position of replica. 
There are use-cases that, even the first broker in the replica assignment is in 
ISR, there is a need for it to be moved to the end of ordering (lowest 
priority) when deciding leadership during  preferred leader election. 

Let’s use topic/partition replica (1,2,3) as an example. 1 is the preferred 
leader.  When preferred leadership is run, it will pick 1 as the leader if it's 
ISR, if 1 is not online and in ISR, then pick 2, if 2 is not in ISR, then pick 
3 as the leader. There are use cases that, even 1 is in ISR, we would like it 
to be moved to the end of ordering (lowest priority) when deciding leadership 
during preferred leader election.   Below is a list of use cases:

* (If broker_id 1 is a swapped failed host and brought up with last segments or 
latest offset without historical data (There is another effort on this), it's 
better for it to not serve leadership till it's caught-up.

* The cross-data center cluster has AWS instances which have less computing 
power than the on-prem bare metal machines.  We could put the AWS broker_ids in 
Preferred Leader Blacklist, so on-prem brokers can be elected leaders, without 
changing the reassignments ordering of the replicas. 

* If the broker_id 1 is constantly losing leadership after some time: 
"Flapping". we would want to exclude 1 to be a leader unless all other brokers 
of this topic/partition are offline.  The “Flapping” effect was seen in the 
past when 2 or more brokers were bad, when they lost leadership 
constantly/quickly, the sets of partition replicas they belong to will see 
leadership constantly changing.  The ultimate solution is to swap these bad 
hosts.  But for quick mitigation, we can also put the bad hosts in the 
Preferred Leader Blacklist to move the priority of its being elected as leaders 
to the lowest. 

*  If the controller is busy serving an extra load of metadata requests and 
other tasks. we would like to put the controller's leaders to other brokers to 
lower its CPU load. currently bouncing to lose leadership would not work for 
Controller, because after the bounce, the controller fails over to another 
broker.

* Avoid bouncing broker in order to lose its leadership: it would be good if we 
have a way to specify which broker should be excluded from serving 
traffic/leadership (without changing the replica assignment ordering by 
reassignments, even though that's quick), and run preferred leader election.  A 
bouncing broker will cause temporary URP, and sometimes other issues.  Also a 
bouncing of broker (e.g. broker_id 1) can temporarily lose all its leadership, 
but if another broker (e.g. broker_id 2) fails or gets bounced, some of its 
leaderships will likely failover to broker_id 1 on a replica with 3 brokers.  
If broker_id 1 is in the blacklist, then in such a scenario even broker_id 2 
offline,  the 3rd broker can take leadership. 


The current work-around of the above is to change the topic/partition's replica 
reassignments to move the broker_id 1 from the first position to the last 
position and run preferred leader election. e.g. (1, 2, 3) => (2, 3, 1). This 
changes the replica reassignments, and we need to keep track of the original 
one and restore if things change (e.g. controller fails over to another broker, 
the swapped empty broker caught up). That’s a rather tedious task.
 

  was:
Currently, the kafka preferred leader election will pick the broker_id in the 
topic/partition replica assignments in a priority order when the broker is in 
ISR. The preferred leader is the broker id in the first position of replica. 
There are use-cases that, even the first broker in the replica assignment is in 
ISR, there is a need for it to be moved to the end of ordering (lowest 
priority) when deciding leadership during  preferred leader election. 

Let’s use topic/partition replica (1,2,3) as an example. 1 is the preferred 
leader.  When preferred leadership is run, it will pick 1 as the leader if it's 
ISR, if 1 is not online and in ISR, then pick 2, if 2 is not in ISR, then pick 
3 as the leader. There are use cases that, even 1 is in ISR, we would like it 
to be moved to the end of ordering (lowest priority) when deciding leadership 
during preferred leader election.   Below is a list of use cases:

# If broker_id 1 is a swapped failed host and brought up with last segments or 
latest offset without historical data (There is another effort on this), it's 
better for it to not serve leadership till it's caught-up.

# The cross-data center cluster has AWS instances which have less computing 
power than the on-prem bare metal machines.  We could put the AWS broker_ids in 
Preferred Leader Blacklist, so on-prem brokers can be elected leaders, without 
changing the reassignments ordering of the replicas. 

# If the broker_id 1 is constantly losing leadership after some time: 
"Flapping". we would want to exclude 1 to be a leader unless all other brokers 
of this topic/partition are offline.  The “Flapping” effect was seen in the 
past when 2 or more brokers were bad, when they lost leadership 
constantly/quickly, the sets of partition replicas they belong to will see 
leadership constantly changing.  The ultimate solution is to swap these bad 
hosts.  But for quick mitigation, we can also put the bad hosts in the 
Preferred Leader Blacklist to move the priority of its being elected as leaders 
to the lowest. 

#  If the controller is busy serving an extra load of metadata requests and 
other tasks. we would like to put the controller's leaders to other brokers to 
lower its CPU load. currently bouncing to lose leadership would not work for 
Controller, because after the bounce, the controller fails over to another 
broker.

# Avoid bouncing broker in order to lose its leadership: it would be good if we 
have a way to specify which broker should be excluded from serving 
traffic/leadership (without changing the replica assignment ordering by 
reassignments, even though that's quick), and run preferred leader election.  A 
bouncing broker will cause temporary URP, and sometimes other issues.  Also a 
bouncing of broker (e.g. broker_id 1) can temporarily lose all its leadership, 
but if another broker (e.g. broker_id 2) fails or gets bounced, some of its 
leaderships will likely failover to broker_id 1 on a replica with 3 brokers.  
If broker_id 1 is in the blacklist, then in such a scenario even broker_id 2 
offline,  the 3rd broker can take leadership. 


The current work-around of the above is to change the topic/partition's replica 
reassignments to move the broker_id 1 from the first position to the last 
position and run preferred leader election. e.g. (1, 2, 3) => (2, 3, 1). This 
changes the replica reassignments, and we need to keep track of the original 
one and restore if things change (e.g. controller fails over to another broker, 
the swapped empty broker caught up). That’s a rather tedious task.
 


> Preferred Leader Blacklist (deprioritized list)
> -----------------------------------------------
>
>                 Key: KAFKA-8638
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8638
>             Project: Kafka
>          Issue Type: Improvement
>          Components: config, controller, core
>    Affects Versions: 1.1.1, 2.3.0, 2.2.1
>            Reporter: GEORGE LI
>            Assignee: GEORGE LI
>            Priority: Major
>
> Currently, the kafka preferred leader election will pick the broker_id in the 
> topic/partition replica assignments in a priority order when the broker is in 
> ISR. The preferred leader is the broker id in the first position of replica. 
> There are use-cases that, even the first broker in the replica assignment is 
> in ISR, there is a need for it to be moved to the end of ordering (lowest 
> priority) when deciding leadership during  preferred leader election. 
> Let’s use topic/partition replica (1,2,3) as an example. 1 is the preferred 
> leader.  When preferred leadership is run, it will pick 1 as the leader if 
> it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not in ISR, 
> then pick 3 as the leader. There are use cases that, even 1 is in ISR, we 
> would like it to be moved to the end of ordering (lowest priority) when 
> deciding leadership during preferred leader election.   Below is a list of 
> use cases:
> * (If broker_id 1 is a swapped failed host and brought up with last segments 
> or latest offset without historical data (There is another effort on this), 
> it's better for it to not serve leadership till it's caught-up.
> * The cross-data center cluster has AWS instances which have less computing 
> power than the on-prem bare metal machines.  We could put the AWS broker_ids 
> in Preferred Leader Blacklist, so on-prem brokers can be elected leaders, 
> without changing the reassignments ordering of the replicas. 
> * If the broker_id 1 is constantly losing leadership after some time: 
> "Flapping". we would want to exclude 1 to be a leader unless all other 
> brokers of this topic/partition are offline.  The “Flapping” effect was seen 
> in the past when 2 or more brokers were bad, when they lost leadership 
> constantly/quickly, the sets of partition replicas they belong to will see 
> leadership constantly changing.  The ultimate solution is to swap these bad 
> hosts.  But for quick mitigation, we can also put the bad hosts in the 
> Preferred Leader Blacklist to move the priority of its being elected as 
> leaders to the lowest. 
> *  If the controller is busy serving an extra load of metadata requests and 
> other tasks. we would like to put the controller's leaders to other brokers 
> to lower its CPU load. currently bouncing to lose leadership would not work 
> for Controller, because after the bounce, the controller fails over to 
> another broker.
> * Avoid bouncing broker in order to lose its leadership: it would be good if 
> we have a way to specify which broker should be excluded from serving 
> traffic/leadership (without changing the replica assignment ordering by 
> reassignments, even though that's quick), and run preferred leader election.  
> A bouncing broker will cause temporary URP, and sometimes other issues.  Also 
> a bouncing of broker (e.g. broker_id 1) can temporarily lose all its 
> leadership, but if another broker (e.g. broker_id 2) fails or gets bounced, 
> some of its leaderships will likely failover to broker_id 1 on a replica with 
> 3 brokers.  If broker_id 1 is in the blacklist, then in such a scenario even 
> broker_id 2 offline,  the 3rd broker can take leadership. 
> The current work-around of the above is to change the topic/partition's 
> replica reassignments to move the broker_id 1 from the first position to the 
> last position and run preferred leader election. e.g. (1, 2, 3) => (2, 3, 1). 
> This changes the replica reassignments, and we need to keep track of the 
> original one and restore if things change (e.g. controller fails over to 
> another broker, the swapped empty broker caught up). That’s a rather tedious 
> task.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to