Chris Egerton created KAFKA-17155:
-------------------------------------

             Summary: Redundant rebalances triggered after connector 
creation/deletion and task config updates
                 Key: KAFKA-17155
                 URL: https://issues.apache.org/jira/browse/KAFKA-17155
             Project: Kafka
          Issue Type: Bug
          Components: connect
    Affects Versions: 3.8.0, 3.9.0
            Reporter: Chris Egerton


With KAFKA-17105, a scenario is described where a connector may be 
unnecessarily restarted soon after it has been created.

Similarly, when any events occur that set the 
[DistributedHerder.needsReconfigRebalance 
flag|https://github.com/apache/kafka/blob/a66a59f427b30611175fd029d86832d00aa5aabd/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L215]
 to true (at the time of writing these are the detection of a new connector, 
the removal of an existing connector, or the detection of new task 
configurations regardless of whether existing configurations existed for the 
connector), it is possible that a rebalance has already started because another 
worker has detected this change as well. In that case, 
{{needsReconfigRebalance}} will still be set to {{true}} even after that 
rebalance has taken place, and the worker will force an unnecessary second 
rebalance.

We might consider changing the "needs reconfig rebalance" field into a 
"reconfig rebalance threshold" field, which contains the latest offset of a 
record consumed from the config topic that warrants a rebalance. When possibly 
performing rebalances based on this field, the worker can check if the offset 
in the assignment given out by the leader during the most recent rebalance is 
greater than or equal to this threshold, and if so, choose not to force a 
rebalance.

 

This has been caused issues in some tests, but may be a benign race condition 
that does not have practical consequences in the real world. We may not want to 
address this (especially with an approach that increases the complexity of the 
code base and comes with risk of regression) until/unless someone states that 
it's affected them outside of Kafka Connect unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to