[ https://issues.apache.org/jira/browse/KAFKA-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854113#comment-17854113 ]
Chris Egerton commented on KAFKA-16931: --------------------------------------- Spoke too soon, have one more thought! [~ecomar] it sounds like you've been using MM2 with exactly-once support enabled. In dedicated mode there's currently no way to restart failed tasks without restarting an entire process, which can be a PITA. Maybe we could add more permissive retry logic for dedicated MM2 clusters in this case, and keep the retry logic for vanilla Kafka Connect clusters either as-is, or at least more conservative? > Transient REST failures to forward fenceZombie requests leave Connect Tasks > in FAILED state > ------------------------------------------------------------------------------------------- > > Key: KAFKA-16931 > URL: https://issues.apache.org/jira/browse/KAFKA-16931 > Project: Kafka > Issue Type: Bug > Components: connect > Reporter: Edoardo Comar > Priority: Major > > When Kafka Connect runs in exactly_once mode, a task restart will fence > possible zombies tasks. > This is achieved forwarding the request to the leader worker using the REST > protocol. > At scale, in distributed mode, occasionally an HTTPs request may fail because > of a networking glitch, reconfiguration etc > Currently there is no attempt to retry the REST request, the task is left in > a FAILED state and requires an external restart (with the REST API). > Would this issue require a small KIP to introduce configuration entries to > limit the number of retries, backoff times etc ? > -- This message was sent by Atlassian Jira (v8.20.10#820010)