[ 
https://issues.apache.org/jira/browse/KAFKA-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872190#comment-16872190
 ] 

muchl commented on KAFKA-7697:
------------------------------

[~rsivaram] The problem was fixed after the upgrade 2.1.1, but there was a new 
problem.I'm not sure if the two questions are related, but the logs they print 
when the problem occurs are similar.
A similar broker hangs was encountered in 2.1.1 . the problem cause broker 
crash in 2.1.0, but will automatically recovered in a few minutes in 2.1.1, and 
the cluster was unavailable during this time. 
I uploaded a log whose file name is 2.1.1-hangs.log  [^2.1.1-hangs.log] . When 
we find and log in to the server, the cluster was restored. All the stack 
information has not yet been obtained, but we can see that there is a problem 
from the logs of the broker and consumer. Could you give me some help,Thank you 
!

> Possible deadlock in kafka.cluster.Partition
> --------------------------------------------
>
>                 Key: KAFKA-7697
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7697
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Gian Merlino
>            Assignee: Rajini Sivaram
>            Priority: Blocker
>             Fix For: 2.2.0, 2.1.1
>
>         Attachments: 2.1.1-hangs.log, 322.tdump, kafka.log, kafka_jstack.txt, 
> threaddump.txt
>
>
> After upgrading a fairly busy broker from 0.10.2.0 to 2.1.0, it locked up 
> within a few minutes (by "locked up" I mean that all request handler threads 
> were busy, and other brokers reported that they couldn't communicate with 
> it). I restarted it a few times and it did the same thing each time. After 
> downgrading to 0.10.2.0, the broker was stable. I attached a threaddump.txt 
> from the last attempt on 2.1.0 that shows lots of kafka-request-handler- 
> threads trying to acquire the leaderIsrUpdateLock lock in 
> kafka.cluster.Partition.
> It jumps out that there are two threads that already have some read lock 
> (can't tell which one) and are trying to acquire a second one (on two 
> different read locks: 0x0000000708184b88 and 0x000000070821f188): 
> kafka-request-handler-1 and kafka-request-handler-4. Both are handling a 
> produce request, and in the process of doing so, are calling 
> Partition.fetchOffsetSnapshot while trying to complete a DelayedFetch. At the 
> same time, both of those locks have writers from other threads waiting on 
> them (kafka-request-handler-2 and kafka-scheduler-6). Neither of those locks 
> appear to have writers that hold them (if only because no threads in the dump 
> are deep enough in inWriteLock to indicate that).
> ReentrantReadWriteLock in nonfair mode prioritizes waiting writers over 
> readers. Is it possible that kafka-request-handler-1 and 
> kafka-request-handler-4 are each trying to read-lock the partition that is 
> currently locked by the other one, and they're both parked waiting for 
> kafka-request-handler-2 and kafka-scheduler-6 to get write locks, which they 
> never will, because the former two threads own read locks and aren't giving 
> them up?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to