[jira] [Commented] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

Lubos Hozzan (Jira) Wed, 14 Jun 2023 06:32:04 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732540#comment-17732540
 ]


Lubos Hozzan commented on KAFKA-7802:
-------------------------------------

Hello.

The problem is still persistent. We using version *3.4.1* in KRaft mode (3 
instances in Kubernetes cluster). Warning looks like (in fact, it is avalanche 
of same or very similar records from each Kafka instance):

{noformat}
[2023-06-13 12:21:25,771] WARN [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={tsttopic-1=PartitionData(topicId=4UDBTCegTPy5mdCbL5fLyg, 
fetchOffset=0, logStartOffset=0, maxBytes=1048576, 
currentLeaderEpoch=Optional[6], lastFetchedEpoch=Optional.empty)}, 
isolationLevel=READ_UNCOMMITTED, removed=, replaced=, 
metadata=(sessionId=INVALID, epoch=INITIAL), rackId=) 
(kafka.server.ReplicaFetcherThread)
{noformat}

Please, focus to the {{metadata}}: *{{sessionId=INVALID, epoch=INITIAL}}*

This problem begin in empty cluster (= PVCs for instances pods are empty) at 
first start. Affected are all instances.

I attempted:
- restart instances one-by-one = no change
- stop all instances at once and start them again at once = no change
- stop all instances and delete PVCs (Kafka cluster start as a empty) = problem 
sometimes disappeared

Is the problem in stored data? When problem disappeared, restarting instances 
not have effect, cluster is running fine. I mean, if instances made their 
folders structure correctly, they are working without any problems.

Next strange thing based on my observation are metrics, particular 
{{BrokerTopicMetrics BytesInPerSec}}:

!BytesInput.png!

As you can see, when problem begin, it looks like, that cluster have two 
leaders (correct leader and some fake) and both generating the metrics (before 
14:20). When problem disappeared (after 14:20), the metrics are emitted only 
from one instance, which was in the time elected leader.

Hope that helps get closer to solve the problem.

Best regards.

> Connection to Broker Disconnected Taking Down the Whole Cluster
> ---------------------------------------------------------------
>
>                 Key: KAFKA-7802
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7802
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.1.0
>            Reporter: Candice Wan
>            Priority: Critical
>         Attachments: BytesInput.png, thread_dump.log
>
>
> We recently upgraded to 2.1.0. Since then, several times per day, we observe 
> some brokers were disconnected when other brokers were trying to fetch the 
> replicas. This issue took down the whole cluster, making all the producers 
> and consumers not able to publish or consume messages. It could be quickly 
> fixed by restarting the problematic broker.
> Here is an example of what we're seeing in the broker which was trying to 
> send fetch request to the problematic one:
> 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
> epoch=1599941))
>  java.io.IOException: Connection to 3 was disconnected before the response 
> was read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
>  at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
>  at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
>  Below is the suspicious log of the problematic broker when the issue 
> happened:
> 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
> failed, rem
>  oving it from the group
>  2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-instance1 in state PreparingRebalance with old 
> generation 27
>  0 (__consumer_offsets-47) (reason: removing member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
> Notifications.ASIA1546980352799 has failed, removing it from the group
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
> with old generation 1 (__consumer_offsets-44) (reason: removing member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> Notifications.ASIA1546980352799 with generation 2 is now empty 
> (__consumer_offsets-44)
>  2019-01-09 08:04:50.388 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision 
> has failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-kafka-vision in state PreparingRebalance with old 
> generation 9 (__consumer_offsets-26) (reason: removing member 
> consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group 
> talon-hades-instance1 has failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-hades-instance1 in state PreparingRebalance with old 
> generation 122 (__consumer_offsets-11) (reason: removing member 
> consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11)
>  2019-01-09 08:04:50.422 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.434 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.514 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:55.297 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group Notifications.ASIA1546980352799: The group is 
> rebalancing, so a rejoin is needed.
>  2019-01-09 08:04:55.419 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group talon-kafka-vision: The group is rebalancing, so a 
> rejoin is needed.
>  2019-01-09 08:04:55.420 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group talon-hades-instance1: The group is rebalancing, so 
> a rejoin is needed.
>  
> We also took the thread dump of the problematic broker (attached). We found 
> all the kafka-request-handler were hanging and waiting for some locks, which 
> seemed to be a resource leak there.
>  
> The java version we are running is 11.0.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

Reply via email to