[ 
https://issues.apache.org/jira/browse/KAFKA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226608#comment-17226608
 ] 

Eran Levy edited comment on KAFKA-10643 at 11/5/20, 9:42 AM:
-------------------------------------------------------------

[~ableegoldman] thanks, sorry for the delayed response, was waiting for it to 
happen again to make sure I provide the correct logs. 

re the DisconnectException - thanks.

Im concerned on specifically 2 issues which they might not be related to each 
other:

First issue - rebalancing suddenly triggers and it takes a long time to finish 
(I suspect that it might related to some network issue):

1. Attached are the logs for the broker and client (client name: d-8)

2. It happened again on 4/11 14:53

3. You can see in the client logs that last line of normal process logged on:  
14:52:33.998. Right after that 14:53:32.030 - the only logs that show up with 
some time between them are the kafka-producer-network-thread logs which looks 
very suspicious. Only around 15:00:32.147 I can see logs related to the state 
store.

5. In the broker logs around 15:00:18 I see that re this client d-8 - reason: 
removing member d-8-2-00941765-25cb-4161-b3ca-f941785d9bc4 on heartbeat 
expiration 

6. right after that I see that all the other client d-2 and others with the 
following log: "...in group d has failed, removing it from the group..."

7. for the same broker (you can see in the broker logs attached) I see some 
logs around its eth1 - that logs around this time - "Nov 4 15:08:02 ip-3-3-3-3 
ec2net: [rewrite_rules] Rewriting rules for eth1"

It looks like d-8 hasnt responses for heartbeats right? do you observe anything 
else that happening here?

What I see from the client network metrics is that 
container_network_receive_bytes_total and 
container_network_transmit_bytes_total are very very low since 14:53.

in addition to what wrote above re static membership, etc - the session timeout 
ms is {{180000}}

Second Issue -

Im bit worried on something that i see happening all the time since we started 
static membership - It looks like that there is some specific client (not 
always the same client, it can be d-0, d-1, etc) that its StreamThreads are 
getting into repetitive preparing for rebalancing stage with the following 
logs: 

 

{"timestamp":"2020-11-05T09:32:03.361Z","level":"INFO","thread":"***-27d55923-f5a4-4af9-b277-09436b6aecb7-StreamThread-1","logger":"org.apache.kafka.streams.KafkaStreams","message":"stream-client
 [***-27d55923-f5a4-4af9-b277-09436b6aecb7] State transition from RUNNING to 
REBALANCING","context":"default"} 
\{"timestamp":"2020-11-05T09:35:53.081Z","level":"INFO","thread":"***-27d55923-f5a4-4af9-b277-09436b6aecb7-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [***-27d55923-f5a4-4af9-b277-09436b6aecb7-StreamThread-1] State transition 
from PARTITIONS_ASSIGNED to RUNNING","context":"default"} 
\{"timestamp":"2020-11-05T09:32:21.737Z","level":"INFO","thread":"***-12cc6d3d-e8f2-43ad-9319-729c3a7427b8-StreamThread-1","logger":"org.apache.kafka.streams.KafkaStreams","message":"stream-client
 [***-12cc6d3d-e8f2-43ad-9319-729c3a7427b8] State transition from REBALANCING 
to RUNNING","context":"default"} 
\{"timestamp":"2020-11-05T09:32:19.803Z","level":"INFO","thread":"main","logger":"org.apache.kafka.streams.processor.internals.assignment.AssignorConfiguration","message":"stream-thread
 [***-12cc6d3d-e8f2-43ad-9319-729c3a7427b8-StreamThread-6-consumer] Cooperative 
rebalancing enabled now","context":"default"}

 

 


was (Author: eran-levy):
[^client-4-11.csv][~ableegoldman] thanks, sorry for the delayed response, was 
waiting for it to happen again to make sure I provide the correct logs. 

re the DisconnectException - thanks.

Im concerned on specifically 2 issues which they might not be related to each 
other:

First issue - rebalancing suddenly triggers and it takes a long time to finish 
(I suspect that it might related to some network issue):

1. Attached are the logs for the broker and client (client name: d-8)

2. It happened again on 4/11 14:53

3. You can see in the client logs that last line of normal process logged on:  
14:52:33.998. Right after that 14:53:32.030 - the only logs that show up with 
some time between them are the kafka-producer-network-thread logs which looks 
very suspicious. Only around 15:00:32.147 I can see logs related to the state 
store.

5. In the broker logs around 15:00:18 I see that re this client d-8 - reason: 
removing member d-8-2-00941765-25cb-4161-b3ca-f941785d9bc4 on heartbeat 
expiration 

6. right after that I see that all the other client d-2 and others with the 
following log: "...in group d has failed, removing it from the group..."

7. for the same broker (you can see in the broker logs attached) I see some 
logs around its eth1 - that logs around this time - "Nov 4 15:08:02 ip-3-3-3-3 
ec2net: [rewrite_rules] Rewriting rules for eth1"

It looks like d-8 hasnt responses for heartbeats right? do you observe anything 
else that happening here?

What I see from the client network metrics is that 
container_network_receive_bytes_total and 
container_network_transmit_bytes_total are very very low since 14:53.

in addition to what wrote above re static membership, etc - the session timeout 
ms is {{180000}}

Second Issue - Im bit worried on something that i see happening all the time 
since we started static membership - It looks like that there is some specific 
client (not always the same client, it can be d-0, d-1, etc) that its 
StreamThreads are getting into repetitive preparing for rebalancing stage with 
the following logs: 

{"timestamp":"2020-11-05T09:32:03.361Z","level":"INFO","thread":"***-27d55923-f5a4-4af9-b277-09436b6aecb7-StreamThread-1","logger":"org.apache.kafka.streams.KafkaStreams","message":"stream-client
 [***-27d55923-f5a4-4af9-b277-09436b6aecb7] State transition from RUNNING to 
REBALANCING","context":"default"}

{"timestamp":"2020-11-05T09:35:53.081Z","level":"INFO","thread":"***-27d55923-f5a4-4af9-b277-09436b6aecb7-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [***-27d55923-f5a4-4af9-b277-09436b6aecb7-StreamThread-1] State transition 
from PARTITIONS_ASSIGNED to RUNNING","context":"default"}

{"timestamp":"2020-11-05T09:32:21.737Z","level":"INFO","thread":"***-12cc6d3d-e8f2-43ad-9319-729c3a7427b8-StreamThread-1","logger":"org.apache.kafka.streams.KafkaStreams","message":"stream-client
 [***-12cc6d3d-e8f2-43ad-9319-729c3a7427b8] State transition from REBALANCING 
to RUNNING","context":"default"}

{"timestamp":"2020-11-05T09:32:19.803Z","level":"INFO","thread":"main","logger":"org.apache.kafka.streams.processor.internals.assignment.AssignorConfiguration","message":"stream-thread
 [***-12cc6d3d-e8f2-43ad-9319-729c3a7427b8-StreamThread-6-consumer] Cooperative 
rebalancing enabled now","context":"default"}

 

 

> Static membership - repetitive PreparingRebalance with updating metadata for 
> member reason
> ------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-10643
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10643
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.0
>            Reporter: Eran Levy
>            Priority: Major
>         Attachments: broker-4-11.csv, client-4-11.csv
>
>
> Kafka streams 2.6.0, brokers version 2.6.0. Kafka nodes are healthy, kafka 
> streams app is healthy. 
> Configured with static membership. 
> Every 10 minutes (I assume cause of topic.metadata.refresh.interval.ms), I 
> see the following group coordinator log for different stream consumers: 
> INFO [GroupCoordinator 2]: Preparing to rebalance group **--**-stream in 
> state PreparingRebalance with old generation 12244 (__consumer_offsets-45) 
> (reason: Updating metadata for member 
> ****-stream-11-1-013edd56-ed93-4370-b07c-1c29fbe72c9a) 
> (kafka.coordinator.group.GroupCoordinator)
> and right after that the following log: 
> INFO [GroupCoordinator 2]: Assignment received from leader for group 
> **-**-stream for generation 12246 (kafka.coordinator.group.GroupCoordinator)
>  
> Looked a bit on the kafka code and Im not sure that I get why such a thing 
> happening - is this line described the situation that happens here re the 
> "reason:"?[https://github.com/apache/kafka/blob/7ca299b8c0f2f3256c40b694078e422350c20d19/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L311]
> I also dont see it happening too often in other kafka streams applications 
> that we have. 
> The only thing suspicious that I see around every hour that different pods of 
> that kafka streams application throw this exception: 
> {"timestamp":"2020-10-25T06:44:20.414Z","level":"INFO","thread":"**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1","logger":"org.apache.kafka.clients.FetchSessionHandler","message":"[Consumer
>  
> clientId=**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1-restore-consumer,
>  groupId=null] Error sending fetch request (sessionId=34683236, epoch=2872) 
> to node 
> 3:","context":"default","exception":"org.apache.kafka.common.errors.DisconnectException:
>  null\n"}
> I came across this strange behaviour after stated to investigate a strange 
> stuck rebalancing state after one of the members left the group and caused 
> the rebalance to stuck - the only thing that I found is that maybe because 
> that too often preparing to rebalance states, the app might affected of this 
> bug - KAFKA-9752 ?
> I dont understand why it happens, it wasn't before I applied static 
> membership to that kafka streams application (since around 2 weeks ago). 
> Will be happy if you can help me
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to