[jira] [Commented] (KAFKA-15059) Exactly-once source tasks fail to start during pending rebalances

Chris Egerton (Jira) Tue, 06 Jun 2023 05:05:06 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729709#comment-17729709
 ]


Chris Egerton commented on KAFKA-15059:
---------------------------------------

On second thought, it may be unnecessary to check for a pending rebalance at 
all.

 

If the worker that we forward the zombie fencing request to is a zombie leader 
(i.e., a worker that believes it is the leader but in reality is not), it will 
fail to finish the round of zombie fencing because it won't be able to write to 
the config topic with a transactional producer.

If the connector has just been deleted, we'll still fail the request since we 
force a read-to-end of the config topic and refresh our snapshot of its 
contents before checking to see if the connector exists.

And regardless, the worker that owns the task will still do a read-to-end of 
the config topic and verify that (1) no new task configs have been generated 
for the connector and (2) the worker is still assigned the connector, before 
allowing the task to process any data.

> Exactly-once source tasks fail to start during pending rebalances
> -----------------------------------------------------------------
>
>                 Key: KAFKA-15059
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15059
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect, mirrormaker
>    Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2, 3.5.0, 3.4.1
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>
> When asked to perform a round of zombie fencing, the distributed herder will 
> [reject the 
> request|https://github.com/apache/kafka/blob/17fd30e6b457f097f6a524b516eca1a6a74a9144/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1249-L1250]
>  if a rebalance is pending, which can happen if (among other things) a config 
> for a new connector or a new set of task configs has been recently read from 
> the config topic.
> Normally this can be alleviated with a simple task restart, which isn't great 
> but isn't terrible.
> However, when running MirrorMaker 2 in dedicated mode, there is no API to 
> restart failed tasks, and it can be more common to see this kind of failure 
> on a fresh cluster because three connector configurations are written in 
> rapid succession to the config topic.
>  
> In order to provide a better experience for users of both vanilla Kafka 
> Connect and dedicated MirrorMaker 2 clusters, we can retry (likely with the 
> same exponential backoff introduced with KAFKA-14732) zombie fencing attempts 
> that fail due to a pending rebalance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-15059) Exactly-once source tasks fail to start during pending rebalances

Reply via email to