[ 
https://issues.apache.org/jira/browse/KAFKA-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378901#comment-16378901
 ] 

Jason Gustafson commented on KAFKA-6593:
----------------------------------------

I've attached the logs in case anyone is interested. What I was struggling to 
explain is the following gap in the logs:
{code:java}
[2018-02-15 07:36:20,902] DEBUG Updated cluster metadata version 18 to 
Cluster(id = CTkloGopRuCUxoicTrG1WA, nodes = [worker6:9092 (id: 2 rack: null), 
worker5:9092 (id: 1 rack: null)], partitions = [Partition(topic = test_topic, 
partition = 0, leader = 1, replicas = [2,1], isr = [1], offlineReplicas = 
[2])]) (org.apache.kafka.clients.Metadata)
[2018-02-15 07:36:33,881] DEBUG [Consumer clientId=consumer-1, 
groupId=test_group_id] Connection with worker5/172.31.44.80 disconnected 
(org.apache.kafka.common.network.Selector)
java.io.EOFException
        at 
org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:124)
        at 
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93){code}
Basically we have a situation where there is a coordinator disconnect which is 
observed by the heartbeat thread during a pending call to commitSync(). 

I think we might be hitting a livelock situation of sorts. The root of the 
problem is basically that the threads are contending for an unfair lock. Here 
is the sequence of events:

1. We begin a call to ConsumerCoordinator.commitOffsetsSync() which sends the 
offset commit and then begins polling for the response.
2. At some point, the heartbeat thread wakes up, observes the coordinator 
disconnect, and attempts to complete the callback for the offset commit.
3. In order to complete the callback, it has to acquire the lock in 
ConsumerNetworkClient because it has some logic in coordinatorDead() to 
disconnect the coordinator. Until this lock is obtained, the future for the 
offset commit response will be seen as uncompleted. 
4. The poll loop in commitOffsetsSync() also contends for the same lock and is 
continually able to acquire it before the heartbeat thread. It may seem 
far-fetched that it is able to do so for such a long time, but actually it will 
hold onto the lock for up to 3 seconds since that is the poll timeout, so it 
does not take too many consecutive acquisitions before the session timeout 
expires.

Switching to a fair lock in ConsumerNetworkClient should mitigate the problem, 
but it may not be a 100% solution since it cannot actually guarantee fairness 
due to thread scheduling uncertainty. In this particular case, the lock 
acquisition from the heartbeat thread in the callback is actually unnecessary 
since we know the coordinator was already disconnected, so we can probably 
improve this as well. We'll have to look through the code though to see whether 
there are other similar cases.

This is probably also one more strike against the shared NetworkClient approach 
used in the consumer.

> Coordinator disconnect in heartbeat thread can cause commitSync to block 
> indefinitely
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6593
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6593
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 1.0.0, 0.11.0.2
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>             Fix For: 1.1.0
>
>         Attachments: consumer.log
>
>
> If a coordinator disconnect is observed in the heartbeat thread, it can cause 
> a pending offset commit to be cancelled just before the foreground thread 
> begins waiting on its response in poll(). Since the poll timeout is 
> Long.MAX_VALUE, this will cause the consumer to effectively hang until some 
> other network event causes the poll() to return. We try to protect this case 
> with a poll condition on the future, but this isn't bulletproof since the 
> future can be completed outside of the lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to