[ 
https://issues.apache.org/jira/browse/KAFKA-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854113#comment-17854113
 ] 

Chris Egerton commented on KAFKA-16931:
---------------------------------------

Spoke too soon, have one more thought! [~ecomar] it sounds like you've been 
using MM2 with exactly-once support enabled. In dedicated mode there's 
currently no way to restart failed tasks without restarting an entire process, 
which can be a PITA. Maybe we could add more permissive retry logic for 
dedicated MM2 clusters in this case, and keep the retry logic for vanilla Kafka 
Connect clusters either as-is, or at least more conservative?

> Transient REST failures to forward fenceZombie requests leave Connect Tasks 
> in FAILED state
> -------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-16931
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16931
>             Project: Kafka
>          Issue Type: Bug
>          Components: connect
>            Reporter: Edoardo Comar
>            Priority: Major
>
> When Kafka Connect runs in exactly_once mode, a task restart will fence 
> possible zombies tasks.
> This is achieved forwarding the request to the leader worker using the REST 
> protocol.
> At scale, in distributed mode, occasionally an HTTPs request may fail because 
> of a networking glitch, reconfiguration etc
> Currently there is no attempt to retry the REST request, the task is left in 
> a FAILED state and requires an external restart (with the REST API).
> Would this issue require a small KIP to introduce configuration entries to  
> limit the number of retries, backoff times etc ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to