Chris Egerton created KAFKA-17155: ------------------------------------- Summary: Redundant rebalances triggered after connector creation/deletion and task config updates Key: KAFKA-17155 URL: https://issues.apache.org/jira/browse/KAFKA-17155 Project: Kafka Issue Type: Bug Components: connect Affects Versions: 3.8.0, 3.9.0 Reporter: Chris Egerton
With KAFKA-17105, a scenario is described where a connector may be unnecessarily restarted soon after it has been created. Similarly, when any events occur that set the [DistributedHerder.needsReconfigRebalance flag|https://github.com/apache/kafka/blob/a66a59f427b30611175fd029d86832d00aa5aabd/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L215] to true (at the time of writing these are the detection of a new connector, the removal of an existing connector, or the detection of new task configurations regardless of whether existing configurations existed for the connector), it is possible that a rebalance has already started because another worker has detected this change as well. In that case, {{needsReconfigRebalance}} will still be set to {{true}} even after that rebalance has taken place, and the worker will force an unnecessary second rebalance. We might consider changing the "needs reconfig rebalance" field into a "reconfig rebalance threshold" field, which contains the latest offset of a record consumed from the config topic that warrants a rebalance. When possibly performing rebalances based on this field, the worker can check if the offset in the assignment given out by the leader during the most recent rebalance is greater than or equal to this threshold, and if so, choose not to force a rebalance. This has been caused issues in some tests, but may be a benign race condition that does not have practical consequences in the real world. We may not want to address this (especially with an approach that increases the complexity of the code base and comes with risk of regression) until/unless someone states that it's affected them outside of Kafka Connect unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)