Chris Egerton created KAFKA-12252:
-------------------------------------

             Summary: Distributed herder tick thread loops rapidly when worker 
loses leadership
                 Key: KAFKA-12252
                 URL: https://issues.apache.org/jira/browse/KAFKA-12252
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
            Reporter: Chris Egerton
            Assignee: Chris Egerton


When a new session key is read from the config topic, if the worker is the 
leader, it [schedules a new key 
rotation|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1579-L1581].
 The time between key rotations is configurable but defaults to an hour.

The herder then continues its tick loop, which usually ends with a long poll 
for rebalance activity. However, when a key rotation is scheduled, it will 
[limit the time spent 
polling|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L384-L388]
 at the end of the tick loop in order to be able to perform the rotation.

Once woken up, the worker checks to see if a key rotation is necessary and, if 
so, [sets the expected key rotation time to 
Long.MAX_VALUE|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L344],
 then [writes a new session key to the config 
topic|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L345-L348].
 The problem is, [the worker only ever decides a key rotation is necessary if 
it is still the 
leader|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L456-L469].
 If the worker is no longer the leader at the time of the key rotation (likely 
due to falling out of the cluster after losing contact with the group 
coordinator), its key expiration time won’t be reset, and the long poll for 
rebalance activity at the end of the tick loop will be given a timeout of 0 ms 
and result in the tick loop being immediately restarted. Even if the worker 
reads a new session key from the config topic, it’ll continue looping like this 
since its scheduled key rotation won’t be updated. At this point, the only 
thing that would help the worker get back into a healthy state would be if it 
were made the leader of the cluster again.

One possible fix could be to add a conditional check in the tick thread to only 
limit the time spent on rebalance polling if the worker is currently the leader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to