[ 
https://issues.apache.org/jira/browse/KAFKA-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Mayya updated KAFKA-15238:
-------------------------------
    Summary: Connect workers can be disabled by DLQ-related blocking admin 
client calls  (was: Connect workers can be disabled by DLQ related stuck admin 
client calls)

> Connect workers can be disabled by DLQ-related blocking admin client calls
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-15238
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15238
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>            Reporter: Yash Mayya
>            Assignee: Yash Mayya
>            Priority: Major
>
> When Kafka Connect is run in distributed mode - if a sink connector's task is 
> restarted (via a worker's REST API), the following sequence of steps will 
> occur (on the DistributedHerder's thread):
>  
>  # The existing sink task will be stopped 
> ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1367])
>  # A new sink task will be started 
> ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1867C40-L1867C40])
>  # As a part of the above step, a new {{WorkerSinkTask}} will be instantiated 
> ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L656-L663])
>  # The DLQ reporter (see 
> [KIP-298|https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect])
>  for the sink task is also instantiated and configured as a part of this 
> ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L1800])
>  # The DLQ reporter setup involves two synchronous admin client calls to list 
> topics and create the DLQ topic if it isn't already created 
> ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/DeadLetterQueueReporter.java#L84-L87])
>  
> All of these are occurring synchronously on the herder's tick thread - in 
> this portion 
> [here|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L457-L469]
>  where external requests are run. If the admin client call in the DLQ 
> reporter setup step blocks for some time (due to auth failures and retries or 
> network issues or whatever other reason), this can cause the Connect worker 
> to become non-functional (REST API requests will timeout) and even fall out 
> of the Connect cluster and become a zombie (since the tick thread also drives 
> group membership functions).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to