[ https://issues.apache.org/jira/browse/KAFKA-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369568#comment-16369568 ]
Noam Berman commented on KAFKA-6565: ------------------------------------ just for closure, another non-production cluster which was still with the 0.10.1.0 crashed just now, and this was the stack trace: JVM version is 25.65-b01 Deadlock Detection: Found one Java-level deadlock: ============================= "group-metadata-manager-0": waiting to lock Monitor@0x00007f634c0398b8 (Object@0x00000000a66603c8, a java/util/LinkedList), which is held by "kafka-request-handler-7" "kafka-request-handler-7": waiting to lock Monitor@0x00007f6334062de8 (Object@0x00000000ab402e40, a kafka/coordinator/GroupMetadata), which is held by "group-metadata-manager-0" Found a total of 1 deadlock. looks like indeed a duplicate bug. thanks! > Sudden unresponsiveness from broker + file descriptor leak > ---------------------------------------------------------- > > Key: KAFKA-6565 > URL: https://issues.apache.org/jira/browse/KAFKA-6565 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.1.0 > Environment: 3 separate clusters running 0.10.1.0 with 5 nodes + 5 > zookeepers. > clients are mostly java 0.10.2.1, and a few nodejs clients > (https://github.com/oleksiyk/kafka). > throughput per broker: ~3.5MB per second. each broker is a leader of ~2500 > partitions. > Debian GNU/Linux 8.8 (jessie) > java version “1.8.0_65” > c4.4xlarge , 1500gb gp2 disks > Reporter: Noam Berman > Priority: Blocker > > Running 3 kafka clusters v0.10.1.0 on aws, 5 brokers each (+5 zookeepers), > in the past week about 2-3 times a day (on all clusters, no specific brokers) > we encounter a situation in which a single broker stops responding to all > requests from clients/other brokers. at this moment all produced messages > fail, and other brokers start writing this to log repeatedly: > [2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch > kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c > (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 878 was disconnected before the response > was read > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) > at scala.Option.foreach(Option.scala:257) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108) > at > kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137) > at > kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143) > at > kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) > at > kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253) > at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) > at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > > producers (examples): > 2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | > catalog-index-0] WARN Got error produce response with correlation id 8147 on > topic-partition catalog-1, retrying (49 attempts left). Error: > REQUEST_TIMED_OUT > 2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | > catalog-index-0] WARN Got error produce response with correlation id 8166 on > topic-partition catalog-index-18, retrying (48 attempts left). Error: > NOT_LEADER_FOR_PARTITION > > Once this happens, file descriptors on the affected broker start increasing > at a high rate until it reaches its maximum. > no relevant info logs on the affected broker, we'll try to gather debug logs > and attach them to the bug next time it happens. > the big issue here is that although the broker is unresponsive, it stays in > the cluster and its zookeeper node isn't cleared, so there's no rebalance > whatsoever - all clients start failing until a manual shutdown occurs. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)