subject:"\[jira\] \[Commented\] \(KAFKA\-16931\) Transient REST failures to forward fenceZombie requests leave Connect Tasks in FAILED state"

[jira] [Commented] (KAFKA-16931) Transient REST failures to forward fenceZombie requests leave Connect Tasks in FAILED state

2024-06-11 Thread Chris Egerton (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854113#comment-17854113
 ] 

Chris Egerton commented on KAFKA-16931:
---

Spoke too soon, have one more thought! [~ecomar] it sounds like you've been 
using MM2 with exactly-once support enabled. In dedicated mode there's 
currently no way to restart failed tasks without restarting an entire process, 
which can be a PITA. Maybe we could add more permissive retry logic for 
dedicated MM2 clusters in this case, and keep the retry logic for vanilla Kafka 
Connect clusters either as-is, or at least more conservative?

> Transient REST failures to forward fenceZombie requests leave Connect Tasks 
> in FAILED state
> ---
>
> Key: KAFKA-16931
> URL: https://issues.apache.org/jira/browse/KAFKA-16931
> Project: Kafka
>  Issue Type: Bug
>  Components: connect
>Reporter: Edoardo Comar
>Priority: Major
>
> When Kafka Connect runs in exactly_once mode, a task restart will fence 
> possible zombies tasks.
> This is achieved forwarding the request to the leader worker using the REST 
> protocol.
> At scale, in distributed mode, occasionally an HTTPs request may fail because 
> of a networking glitch, reconfiguration etc
> Currently there is no attempt to retry the REST request, the task is left in 
> a FAILED state and requires an external restart (with the REST API).
> Would this issue require a small KIP to introduce configuration entries to  
> limit the number of retries, backoff times etc ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-16931) Transient REST failures to forward fenceZombie requests leave Connect Tasks in FAILED state

2024-06-11 Thread Chris Egerton (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854091#comment-17854091
 ] 

Chris Egerton commented on KAFKA-16931:
---

First, one small clarification: task restarts do not result in zombie fencings 
unless no successful zombie fencing has taken place yet for the current 
generation of task configs. They do require an unconditional REST request to 
the leader to check on whether that fencing has taken place yet, and to perform 
one if it hasn't.

With that out of the way, a KIP would definitely be required if we wanted to 
add new configurations related to retries. We could add some hard-coded retry 
logic for now, which IMO wouldn't require a KIP. The tricky part either way 
would be striking a balance between resiliency to transient failures (which the 
current design certainly lacks) and surfacing non-retriable errors to users in 
an easily-accessible manner (which, despite its shortcomings, the current 
design does fairly well).

> Transient REST failures to forward fenceZombie requests leave Connect Tasks 
> in FAILED state
> ---
>
> Key: KAFKA-16931
> URL: https://issues.apache.org/jira/browse/KAFKA-16931
> Project: Kafka
>  Issue Type: Bug
>  Components: connect
>Reporter: Edoardo Comar
>Priority: Major
>
> When Kafka Connect runs in exactly_once mode, a task restart will fence 
> possible zombies tasks.
> This is achieved forwarding the request to the leader worker using the REST 
> protocol.
> At scale, in distributed mode, occasionally an HTTPs request may fail because 
> of a networking glitch, reconfiguration etc
> Currently there is no attempt to retry the REST request, the task is left in 
> a FAILED state and requires an external restart (with the REST API).
> Would this issue require a small KIP to introduce configuration entries to  
> limit the number of retries, backoff times etc ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-16931) Transient REST failures to forward fenceZombie requests leave Connect Tasks in FAILED state

[jira] [Commented] (KAFKA-16931) Transient REST failures to forward fenceZombie requests leave Connect Tasks in FAILED state

2 matches

Site Navigation

Mail list logo

Footer information