[ 
https://issues.apache.org/jira/browse/KAFKA-12252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch resolved KAFKA-12252.
-----------------------------------
    Fix Version/s: 2.8.1
                   3.0.0
       Resolution: Fixed

I'm still working on backporting this to the 2.7 and 2.6 branches. When I'm 
able to do that, I'll update the fix versions on this issue.

> Distributed herder tick thread loops rapidly when worker loses leadership
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-12252
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12252
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>             Fix For: 3.0.0, 2.8.1
>
>
> When a new session key is read from the config topic, if the worker is the 
> leader, it [schedules a new key 
> rotation|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1579-L1581].
>  The time between key rotations is configurable but defaults to an hour.
> The herder then continues its tick loop, which usually ends with a long poll 
> for rebalance activity. However, when a key rotation is scheduled, it will 
> [limit the time spent 
> polling|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L384-L388]
>  at the end of the tick loop in order to be able to perform the rotation.
> Once woken up, the worker checks to see if a key rotation is necessary and, 
> if so, [sets the expected key rotation time to 
> Long.MAX_VALUE|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L344],
>  then [writes a new session key to the config 
> topic|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L345-L348].
>  The problem is, [the worker only ever decides a key rotation is necessary if 
> it is still the 
> leader|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L456-L469].
>  If the worker is no longer the leader at the time of the key rotation 
> (likely due to falling out of the cluster after losing contact with the group 
> coordinator), its key expiration time won’t be reset, and the long poll for 
> rebalance activity at the end of the tick loop will be given a timeout of 0 
> ms and result in the tick loop being immediately restarted. Even if the 
> worker reads a new session key from the config topic, it’ll continue looping 
> like this since its scheduled key rotation won’t be updated. At this point, 
> the only thing that would help the worker get back into a healthy state would 
> be if it were made the leader of the cluster again.
> One possible fix could be to add a conditional check in the tick thread to 
> only limit the time spent on rebalance polling if the worker is currently the 
> leader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to