Yash Mayya created KAFKA-15238:
----------------------------------

             Summary: Connect workers can be disabled by DLQ related stuck 
admin client calls
                 Key: KAFKA-15238
                 URL: https://issues.apache.org/jira/browse/KAFKA-15238
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
            Reporter: Yash Mayya
            Assignee: Yash Mayya


When Kafka Connect is run in distributed mode - if a sink connector's task is 
restarted (via a worker's REST API), the following sequence of steps will occur 
(on the DistributedHerder's thread):

 
 # The existing sink task will be stopped 
([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1367])
 # A new sink task will be started 
([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1867C40-L1867C40])
 # As a part of the above step, a new {{WorkerSinkTask}} will be instantiated 
([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L656-L663])
 # The DLQ reporter (see 
[KIP-298|https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect])
 for the sink task is also instantiated and configured as a part of this 
([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L1800])
 # The DLQ reporter setup involves two synchronous admin client calls to list 
topics and create the DLQ topic if it isn't already created 
([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/DeadLetterQueueReporter.java#L84-L87])

 

All of these are occurring synchronously on the herder's tick thread - in this 
portion 
[here|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L457-L469]
 where external requests are run. If the admin client call in the DLQ reporter 
setup step blocks for some time (due to auth failures and retries or network 
issues or whatever other reason), this can cause the Connect worker to become 
non-functional (REST API requests will timeout) and even fall out of the 
Connect cluster and become a zombie (since the tick thread also drives group 
membership functions).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to