Raman Gupta created KAFKA-10229:
-----------------------------------
Summary: Kafka stream dies when earlier shut down node leaves
group, no errors logged on client
Key: KAFKA-10229
URL: https://issues.apache.org/jira/browse/KAFKA-10229
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 2.4.1
Reporter: Raman Gupta
My broker and clients are 2.4.1. I'm currently running a single broker. I have
a Kafka stream with exactly once processing turned on. I also have an uncaught
exception handler defined on the client. I have a stream which I noticed was
lagging. Upon investigation, I see that the consumer group was empty.
On restarting the consumers, the consumer group re-established itself, but
after about 8 minutes, the group became empty again. There is nothing logged on
the client side about any stream errors, despite the existence of an uncaught
exception handler.
In the broker logs, I see that about 8 minutes after the clients restart / the
stream goes to RUNNING state:
```
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Member
cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 in group
produs-cisFileIndexer-stream has failed, removing it from the group
(kafka.coordinator.group.GroupCoordinator)
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Preparing to rebalance
group produs-cisFileIndexer-stream in state PreparingRebalance with old
generation 228 (__consumer_offsets-3) (reason: removing member
cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 on heartbeat
expiration) (kafka.coordinator.group.GroupCoordinator)
```
so according to this the consumer heartbeat has expired. I don't know why this
would be, logging shows that the stream was running and processing messages
normally and then just stopped processing anything about 4 minutes before it
dies, with no apparent errors or issues or anything logged via the uncaught
exception handler.
It doesn't appear to be related to any specific poison pill type messages:
restarting the stream causes it to reprocess a bunch more messages from the
backlog, and then die again approximately 8 minutes later. At the time of the
last message consumed by the stream, there are no `INFO`-level or above logs
either in the client or the broker, or any errors whatsoever. The stream
consumption simply stops.
There are two consumers -- even if I limit consumption to only a single
consumer, the same thing happens.
The runtime environment is Kubernetes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)