[jira] [Updated] (KAFKA-17155) Redundant rebalances triggered after connector creation/deletion and task config updates

Chris Egerton (Jira) Wed, 17 Jul 2024 10:10:03 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Egerton updated KAFKA-17155:
----------------------------------
    Description: 
With KAFKA-17105, a scenario is described where a connector may be 
unnecessarily restarted soon after it has been created.

Similarly, when any events occur that set the 
[DistributedHerder.needsReconfigRebalance 
flag|https://github.com/apache/kafka/blob/a66a59f427b30611175fd029d86832d00aa5aabd/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L215]
 to true (at the time of writing these are the detection of a new connector, 
the removal of an existing connector, or the detection of new task 
configurations regardless of whether existing configurations existed for the 
connector), it is possible that a rebalance has already started because another 
worker has detected this change as well. In that case, 
{{needsReconfigRebalance}} will still be set to {{true}} even after that 
rebalance has taken place, and the worker will force an unnecessary second 
rebalance.

We might consider changing the "needs reconfig rebalance" field into a 
"reconfig rebalance threshold" field, which contains the latest offset of a 
record consumed from the config topic that warrants a rebalance. When possibly 
performing rebalances based on this field, the worker can check if the offset 
in the assignment given out by the leader during the most recent rebalance is 
greater than or equal to this threshold, and if so, choose not to force a 
rebalance.

 

This has caused issues in some tests, but may be a benign race condition that 
does not have practical consequences in the real world. We may not want to 
address this (especially with an approach that increases the complexity of the 
code base and comes with risk of regression) until/unless someone states that 
it's affected them outside of Kafka Connect unit tests.

  was:
With KAFKA-17105, a scenario is described where a connector may be 
unnecessarily restarted soon after it has been created.

Similarly, when any events occur that set the 
[DistributedHerder.needsReconfigRebalance 
flag|https://github.com/apache/kafka/blob/a66a59f427b30611175fd029d86832d00aa5aabd/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L215]
 to true (at the time of writing these are the detection of a new connector, 
the removal of an existing connector, or the detection of new task 
configurations regardless of whether existing configurations existed for the 
connector), it is possible that a rebalance has already started because another 
worker has detected this change as well. In that case, 
{{needsReconfigRebalance}} will still be set to {{true}} even after that 
rebalance has taken place, and the worker will force an unnecessary second 
rebalance.

We might consider changing the "needs reconfig rebalance" field into a 
"reconfig rebalance threshold" field, which contains the latest offset of a 
record consumed from the config topic that warrants a rebalance. When possibly 
performing rebalances based on this field, the worker can check if the offset 
in the assignment given out by the leader during the most recent rebalance is 
greater than or equal to this threshold, and if so, choose not to force a 
rebalance.

 

This has been caused issues in some tests, but may be a benign race condition 
that does not have practical consequences in the real world. We may not want to 
address this (especially with an approach that increases the complexity of the 
code base and comes with risk of regression) until/unless someone states that 
it's affected them outside of Kafka Connect unit tests.


> Redundant rebalances triggered after connector creation/deletion and task 
> config updates
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-17155
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17155
>             Project: Kafka
>          Issue Type: Bug
>          Components: connect
>    Affects Versions: 3.8.0, 3.9.0
>            Reporter: Chris Egerton
>            Priority: Minor
>
> With KAFKA-17105, a scenario is described where a connector may be 
> unnecessarily restarted soon after it has been created.
> Similarly, when any events occur that set the 
> [DistributedHerder.needsReconfigRebalance 
> flag|https://github.com/apache/kafka/blob/a66a59f427b30611175fd029d86832d00aa5aabd/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L215]
>  to true (at the time of writing these are the detection of a new connector, 
> the removal of an existing connector, or the detection of new task 
> configurations regardless of whether existing configurations existed for the 
> connector), it is possible that a rebalance has already started because 
> another worker has detected this change as well. In that case, 
> {{needsReconfigRebalance}} will still be set to {{true}} even after that 
> rebalance has taken place, and the worker will force an unnecessary second 
> rebalance.
> We might consider changing the "needs reconfig rebalance" field into a 
> "reconfig rebalance threshold" field, which contains the latest offset of a 
> record consumed from the config topic that warrants a rebalance. When 
> possibly performing rebalances based on this field, the worker can check if 
> the offset in the assignment given out by the leader during the most recent 
> rebalance is greater than or equal to this threshold, and if so, choose not 
> to force a rebalance.
>  
> This has caused issues in some tests, but may be a benign race condition that 
> does not have practical consequences in the real world. We may not want to 
> address this (especially with an approach that increases the complexity of 
> the code base and comes with risk of regression) until/unless someone states 
> that it's affected them outside of Kafka Connect unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-17155) Redundant rebalances triggered after connector creation/deletion and task config updates

Reply via email to