Yash Mayya created KAFKA-15238: ---------------------------------- Summary: Connect workers can be disabled by DLQ related stuck admin client calls Key: KAFKA-15238 URL: https://issues.apache.org/jira/browse/KAFKA-15238 Project: Kafka Issue Type: Bug Components: KafkaConnect Reporter: Yash Mayya Assignee: Yash Mayya
When Kafka Connect is run in distributed mode - if a sink connector's task is restarted (via a worker's REST API), the following sequence of steps will occur (on the DistributedHerder's thread): # The existing sink task will be stopped ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1367]) # A new sink task will be started ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1867C40-L1867C40]) # As a part of the above step, a new {{WorkerSinkTask}} will be instantiated ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L656-L663]) # The DLQ reporter (see [KIP-298|https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect]) for the sink task is also instantiated and configured as a part of this ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L1800]) # The DLQ reporter setup involves two synchronous admin client calls to list topics and create the DLQ topic if it isn't already created ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/DeadLetterQueueReporter.java#L84-L87]) All of these are occurring synchronously on the herder's tick thread - in this portion [here|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L457-L469] where external requests are run. If the admin client call in the DLQ reporter setup step blocks for some time (due to auth failures and retries or network issues or whatever other reason), this can cause the Connect worker to become non-functional (REST API requests will timeout) and even fall out of the Connect cluster and become a zombie (since the tick thread also drives group membership functions). -- This message was sent by Atlassian Jira (v8.20.10#820010)