[ 
https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040490#comment-16040490
 ] 

Rajini Sivaram commented on KAFKA-5395:
---------------------------------------

[~ijuma] Will take a look at this today, thanks.

> Distributed Herder Deadlocks on Shutdown
> ----------------------------------------
>
>                 Key: KAFKA-5395
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5395
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 0.10.2.1
>            Reporter: Michael Jaschob
>            Assignee: Rajini Sivaram
>            Priority: Critical
>             Fix For: 0.11.0.0
>
>         Attachments: connect_01021_shutdown_deadlock.txt
>
>
> We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process 
> does not shut down cleanly. It hangs instead. From what I can tell 
> [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
>  introduced this deadlock.
> [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
>  on the AbstractCoordinator is marked as synchronized and acquires the 
> coordinator's monitor. The first thing it tries to do is 
> [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
>  the heartbeat thread.
> Meanwhile, the heartbeat thread is [synchronized on the same 
> monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
>  which it relinquishes when it 
> [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
>  But for the wait to return (and the run method of the heartbeat to 
> terminate) it needs to reacquire that monitor.
> There's no way for the heartbeat thread to reacquire the monitor since it is 
> held by the distributed herder thread. And the distributed herder will never 
> relinquish the monitor since it is waiting for the heartbeat thread to join.
> I am attaching a thread dump illustrating the situation. Take note in 
> particular of threads #178 (the heartbeat thread) and #159 (the herder 
> thread). The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and 
> the latter is WAITING on the heartbeat thread to join, having itself acquired 
> 0x00000007406cc0c0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to