[ https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16739567#comment-16739567 ]
Christoffer Hammarström edited comment on KAFKA-7802 at 1/10/19 4:15 PM: ------------------------------------------------------------------------- Sounds like you might be affected by KAFKA-7697 ? was (Author: kreiger): Sounds like you might be affected by https://jira.apache.org/jira/browse/KAFKA-7697 ? > Connection to Broker Disconnected Taking Down the Whole Cluster > --------------------------------------------------------------- > > Key: KAFKA-7802 > URL: https://issues.apache.org/jira/browse/KAFKA-7802 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.1.0 > Reporter: Candice Wan > Priority: Critical > Attachments: thread_dump.log > > > We recently upgraded to 2.1.0. Since then, several times per day, we observe > some brokers were disconnected when other brokers were trying to fetch the > replicas. This issue took down the whole cluster, making all the producers > and consumers not able to publish or consume messages. It could be quickly > fixed by restarting the problematic broker. > Here is an example of what we're seeing in the broker which was trying to > send fetch request to the problematic one: > 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO > o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, > fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) > to node 3: java.io.IOException: Connection to 3 was disconnected before the > response was read. > 2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO > o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, > fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) > to node 3: java.io.IOException: Connection to 3 was disconnected before the > response was read. > 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN > kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, > fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, > maxBytes=1048576, currentLeaderEpoch=Optional[178])}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, > epoch=1599941)) > java.io.IOException: Connection to 3 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199) > at > kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > > > Below is the suspicious log of the problematic broker when the issue > happened: > 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has > failed, rem > oving it from the group > 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to > rebalance group talon-instance1 in state PreparingRebalance with old > generation 27 > 0 (__consumer_offsets-47) (reason: removing member > consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration) > 2019-01-09 08:04:50.297 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group > Notifications.ASIA1546980352799 has failed, removing it from the group > 2019-01-09 08:04:50.297 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to > rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance > with old generation 1 (__consumer_offsets-44) (reason: removing member > consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration) > 2019-01-09 08:04:50.297 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group > Notifications.ASIA1546980352799 with generation 2 is now empty > (__consumer_offsets-44) > 2019-01-09 08:04:50.388 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has > failed, removing it from the group > 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision > has failed, removing it from the group > 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to > rebalance group talon-kafka-vision in state PreparingRebalance with old > generation 9 (__consumer_offsets-26) (reason: removing member > consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration) > 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group > talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26) > 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group > talon-hades-instance1 has failed, removing it from the group > 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to > rebalance group talon-hades-instance1 in state PreparingRebalance with old > generation 122 (__consumer_offsets-11) (reason: removing member > consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration) > 2019-01-09 08:04:50.419 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group > talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11) > 2019-01-09 08:04:50.422 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has > failed, removing it from the group > 2019-01-09 08:04:50.434 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has > failed, removing it from the group > 2019-01-09 08:04:50.514 [executor-Heartbeat] INFO > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member > consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has > failed, removing it from the group > 2019-01-09 08:04:55.297 [executor-Produce] WARN > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write > empty metadata for group Notifications.ASIA1546980352799: The group is > rebalancing, so a rejoin is needed. > 2019-01-09 08:04:55.419 [executor-Produce] WARN > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write > empty metadata for group talon-kafka-vision: The group is rebalancing, so a > rejoin is needed. > 2019-01-09 08:04:55.420 [executor-Produce] WARN > k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write > empty metadata for group talon-hades-instance1: The group is rebalancing, so > a rejoin is needed. > > We also took the thread dump of the problematic broker (attached). We found > all the kafka-request-handler were hanging and waiting for some locks, which > seemed to be a resource leak there. > > The java version we are running is 11.0.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)