[ https://issues.apache.org/jira/browse/KAFKA-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872190#comment-16872190 ]
muchl commented on KAFKA-7697: ------------------------------ [~rsivaram] The problem was fixed after the upgrade 2.1.1, but there was a new problem.I'm not sure if the two questions are related, but the logs they print when the problem occurs are similar. A similar broker hangs was encountered in 2.1.1 . the problem cause broker crash in 2.1.0, but will automatically recovered in a few minutes in 2.1.1, and the cluster was unavailable during this time. I uploaded a log whose file name is 2.1.1-hangs.log [^2.1.1-hangs.log] . When we find and log in to the server, the cluster was restored. All the stack information has not yet been obtained, but we can see that there is a problem from the logs of the broker and consumer. Could you give me some help,Thank you ! > Possible deadlock in kafka.cluster.Partition > -------------------------------------------- > > Key: KAFKA-7697 > URL: https://issues.apache.org/jira/browse/KAFKA-7697 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.1.0 > Reporter: Gian Merlino > Assignee: Rajini Sivaram > Priority: Blocker > Fix For: 2.2.0, 2.1.1 > > Attachments: 2.1.1-hangs.log, 322.tdump, kafka.log, kafka_jstack.txt, > threaddump.txt > > > After upgrading a fairly busy broker from 0.10.2.0 to 2.1.0, it locked up > within a few minutes (by "locked up" I mean that all request handler threads > were busy, and other brokers reported that they couldn't communicate with > it). I restarted it a few times and it did the same thing each time. After > downgrading to 0.10.2.0, the broker was stable. I attached a threaddump.txt > from the last attempt on 2.1.0 that shows lots of kafka-request-handler- > threads trying to acquire the leaderIsrUpdateLock lock in > kafka.cluster.Partition. > It jumps out that there are two threads that already have some read lock > (can't tell which one) and are trying to acquire a second one (on two > different read locks: 0x0000000708184b88 and 0x000000070821f188): > kafka-request-handler-1 and kafka-request-handler-4. Both are handling a > produce request, and in the process of doing so, are calling > Partition.fetchOffsetSnapshot while trying to complete a DelayedFetch. At the > same time, both of those locks have writers from other threads waiting on > them (kafka-request-handler-2 and kafka-scheduler-6). Neither of those locks > appear to have writers that hold them (if only because no threads in the dump > are deep enough in inWriteLock to indicate that). > ReentrantReadWriteLock in nonfair mode prioritizes waiting writers over > readers. Is it possible that kafka-request-handler-1 and > kafka-request-handler-4 are each trying to read-lock the partition that is > currently locked by the other one, and they're both parked waiting for > kafka-request-handler-2 and kafka-scheduler-6 to get write locks, which they > never will, because the former two threads own read locks and aren't giving > them up? -- This message was sent by Atlassian JIRA (v7.6.3#76005)