Well it sounds like your app is getting stuck somewhere in the poll loop so
it's unable to call poll
again within the session timeout, as the error message indicates -- it's a
bit misleading as it says
"Sending LeaveGroup request to coordinator" which implies it's
*currently* sending
the LeaveGroup,
but IIRC this error actually comes from the heartbeat thread -- just a long
way of clarifying that the
reason you don't see the state go into REBALANCING is because the
StreamThread is stuck and
can't rejoin the group by calling #poll

So...what now? I know your question was how to detect this, but I would
recommend first trying to
take a look into your application topology to figure out where, and *why*,
it's getting stuck (sorry for
the "q: how do I do X? answ. don't do X, do Y" StackOverflow-type response
-- happy to help with
that if we really can't resolve the underlying issue, I'll give it some
thought since I can't think of any
easy way to detect this off the top of my head)

What does your topology look like? Can you figure out what point it's
hanging? You may need to turn
on  DEBUG logging, or even TRACE, but given how infrequent/random this is
I'm guessing that's off
the table -- still, DEBUG logging at least would help.

Do you have any custom processors? Or anything in your topology that could
possibly fall into an
infinite loop? If not, I would suspect it's related to rocksdb -- but let's
start with the other stuff before
we go digging into that

Hope this helps,
Sophie

On Tue, Aug 16, 2022 at 1:06 PM Samuel Azcona <sazc...@itbit.com.invalid>
wrote:

> Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
> consumer leave group message. Exactly same issue described to another
> person here:
>
>
> https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state
>
> But the issue is that stream state is continuing reporting that the stream
> is running, but it's not consuming anything, but the stream never rejoin
> the consumer group, so my application with only one replica stop consuming.
>
> I have a health check on Kubernetes where I expose the stream state to then
> restart the pod.
> But as the kafka stream state it's always running when the consumer leaves
> the group, the app is still healthy in zombie state, so I need to manually
> go and restart the pod.
>
> Is this a bug? Or is there a way to check what is the stream consumer state
> to then expose as healt check for my application?
>
> This issue really happen randomly, usually all the Mondays. I'm using Kafka
> 2.8.1 and my app is made in kotlin.
>
> This is the message I get before zombie state, then there are no
> exceptions, errors or nothing more until I restart the pod manually.
>
> Sending LeaveGroup request to coordinator
> b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null)
> due to consumer poll timeout has expired. This means the time between
> subsequent calls to poll() was longer than the configured
> max.poll.interval.ms, which typically implies that the poll loop is
> spending too much time processing messages. You can address this
> either by increasing max.poll.interval.ms or by reducing the maximum
> size of batches returned in poll() with max.poll.records.
>
> Thanks for the help.
>

Reply via email to