RE: Re: kafka stream zombie state

2022-08-19 Thread Samuel Azcona
Hi Sophie, thanks for your reply, im a big fan of your videos.

About the issue the logger comes from 

> org.apache.kafka.clients.consumer.internals.AbstractCoordinator

After the message I don’t see any other log or exception, just the stream stop 
consuming messages.

As you mention yea looks like the heartbeat thread is stuck. I already debug my 
topology and I was not able to reproduce the issue on my local, but this is 
happening so frequently on production.

Our application have multiple streams running in the same app with different 
> StreamsConfig.APPLICATION_ID_CONFIG

Like 
myapp-stream1
myapp-stream2
Myapp-stream3

The issue happens so randomly, some times to one stream other times to another. 
But for example when it happen to myapp-stream2, other streams continue working 
good, in some cases happen to only one, other times to two or more streams.

Our topology contains a multiple branches where depending on some conditions we 
delegate to different handlers. But we are doing joins between the stream and 
global ktable that have a global storage using rocksdb as you mention.

We don’t have nothing like a infinite loop or synchronized functions, etc. 
basically our topologies consume an event, and create a new message using that 
event as a source joining with a global ktable and then we send out the new 
message to another topic.


What I see frequently also its some core dumps from rocksdb
# JRE version: OpenJDK Runtime Environment Zulu11.58+15-CA (11.0.16+8) (build 
11.0.16+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu11.58+15-CA (11.0.16+8-LTS, mixed mode, 
tiered, compressed oops, serial gc, linux-amd64)
# Problematic frame:
# C [librocksdbjni13283553706008007881.so+0x44f8b1] 
rocksdb::ThreadPoolImpl::Impl::Submit(std::function&&, 
std::function&&, void*)+0x1d1
 
I was able to reproduce the message sending leave group blabla on my local by 
play with this property
MAX_POLL_INTERVAL_MS_CONFIG to 1ms and in my local I see the thread rebalancing 
and working good, I was not able yet to reproduce the issue locally.

We already try to increase the MAX_POLL_INTERVAL_MS_CONFIG to 10 minutes on 
production but the issue still happen.

We are using Kafka 2.8.1 and i see there is a bug report 
https://issues.apache.org/jira/browse/KAFKA-13310 affecting 2.8.1 and solved on 
3.2.0 but im not sure it my problem can be related?

We was able to create a hack to make our application unhealthy when its happen 
to make Kubernetes restart the pod.

We do it by capturing the logger from 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator and when the 
message contains Sending LeaveGroup request to coordinator we mark our app 
unhealthy. 

Because if not the app stay in zombie mode without consume and we don’t notice 
that. 

We are going to try to enable the kafka log in debug/trace mode on production 
to see if we can have a better logs.

Thanks a lot Sophie.


On 2022/08/19 09:18:21 Sophie Blee-Goldman wrote:
> Well it sounds like your app is getting stuck somewhere in the poll loop so
> it's unable to call poll
> again within the session timeout, as the error message indicates -- it's a
> bit misleading as it says
> "Sending LeaveGroup request to coordinator" which implies it's
> *currently* sending
> the LeaveGroup,
> but IIRC this error actually comes from the heartbeat thread -- just a long
> way of clarifying that the
> reason you don't see the state go into REBALANCING is because the
> StreamThread is stuck and
> can't rejoin the group by calling #poll
> 
> So...what now? I know your question was how to detect this, but I would
> recommend first trying to
> take a look into your application topology to figure out where, and *why*,
> it's getting stuck (sorry for
> the "q: how do I do X? answ. don't do X, do Y" StackOverflow-type response
> -- happy to help with
> that if we really can't resolve the underlying issue, I'll give it some
> thought since I can't think of any
> easy way to detect this off the top of my head)
> 
> What does your topology look like? Can you figure out what point it's
> hanging? You may need to turn
> on  DEBUG logging, or even TRACE, but given how infrequent/random this is
> I'm guessing that's off
> the table -- still, DEBUG logging at least would help.
> 
> Do you have any custom processors? Or anything in your topology that could
> possibly fall into an
> infinite loop? If not, I would suspect it's related to rocksdb -- but let's
> start with the other stuff before
> we go digging into that
> 
> Hope this helps,
> Sophie
> 
> On Tue, Aug 16, 2022 at 1:06 PM Samuel Azcona 
> wrote:
> 
> > Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
> > consumer leave group message. Exactly same issue described to another
> > person here:
> >
> >
> > https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state
> >
> > But the issue is that stream state is continuing reporting that the stream
> > is running, but

Re: kafka stream zombie state

2022-08-19 Thread Sophie Blee-Goldman
Well it sounds like your app is getting stuck somewhere in the poll loop so
it's unable to call poll
again within the session timeout, as the error message indicates -- it's a
bit misleading as it says
"Sending LeaveGroup request to coordinator" which implies it's
*currently* sending
the LeaveGroup,
but IIRC this error actually comes from the heartbeat thread -- just a long
way of clarifying that the
reason you don't see the state go into REBALANCING is because the
StreamThread is stuck and
can't rejoin the group by calling #poll

So...what now? I know your question was how to detect this, but I would
recommend first trying to
take a look into your application topology to figure out where, and *why*,
it's getting stuck (sorry for
the "q: how do I do X? answ. don't do X, do Y" StackOverflow-type response
-- happy to help with
that if we really can't resolve the underlying issue, I'll give it some
thought since I can't think of any
easy way to detect this off the top of my head)

What does your topology look like? Can you figure out what point it's
hanging? You may need to turn
on  DEBUG logging, or even TRACE, but given how infrequent/random this is
I'm guessing that's off
the table -- still, DEBUG logging at least would help.

Do you have any custom processors? Or anything in your topology that could
possibly fall into an
infinite loop? If not, I would suspect it's related to rocksdb -- but let's
start with the other stuff before
we go digging into that

Hope this helps,
Sophie

On Tue, Aug 16, 2022 at 1:06 PM Samuel Azcona 
wrote:

> Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
> consumer leave group message. Exactly same issue described to another
> person here:
>
>
> https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state
>
> But the issue is that stream state is continuing reporting that the stream
> is running, but it's not consuming anything, but the stream never rejoin
> the consumer group, so my application with only one replica stop consuming.
>
> I have a health check on Kubernetes where I expose the stream state to then
> restart the pod.
> But as the kafka stream state it's always running when the consumer leaves
> the group, the app is still healthy in zombie state, so I need to manually
> go and restart the pod.
>
> Is this a bug? Or is there a way to check what is the stream consumer state
> to then expose as healt check for my application?
>
> This issue really happen randomly, usually all the Mondays. I'm using Kafka
> 2.8.1 and my app is made in kotlin.
>
> This is the message I get before zombie state, then there are no
> exceptions, errors or nothing more until I restart the pod manually.
>
> Sending LeaveGroup request to coordinator
> b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null)
> due to consumer poll timeout has expired. This means the time between
> subsequent calls to poll() was longer than the configured
> max.poll.interval.ms, which typically implies that the poll loop is
> spending too much time processing messages. You can address this
> either by increasing max.poll.interval.ms or by reducing the maximum
> size of batches returned in poll() with max.poll.records.
>
> Thanks for the help.
>


kafka stream zombie state

2022-08-16 Thread Samuel Azcona
Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
consumer leave group message. Exactly same issue described to another
person here:

https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state

But the issue is that stream state is continuing reporting that the stream
is running, but it's not consuming anything, but the stream never rejoin
the consumer group, so my application with only one replica stop consuming.

I have a health check on Kubernetes where I expose the stream state to then
restart the pod.
But as the kafka stream state it's always running when the consumer leaves
the group, the app is still healthy in zombie state, so I need to manually
go and restart the pod.

Is this a bug? Or is there a way to check what is the stream consumer state
to then expose as healt check for my application?

This issue really happen randomly, usually all the Mondays. I'm using Kafka
2.8.1 and my app is made in kotlin.

This is the message I get before zombie state, then there are no
exceptions, errors or nothing more until I restart the pod manually.

Sending LeaveGroup request to coordinator
b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null)
due to consumer poll timeout has expired. This means the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time processing messages. You can address this
either by increasing max.poll.interval.ms or by reducing the maximum
size of batches returned in poll() with max.poll.records.

Thanks for the help.


kafka stream zombie state

2022-08-16 Thread Samuel Azcona
Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
consumer leave group message. Exactly same issue described to another
person here:

https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state

But the issue is that stream state is continuing reporting that the stream
is running, but it's not consuming anything, but the stream never rejoin
the consumer group, so my application with only one replica stop consuming.

I have a health check on Kubernetes where I expose the stream state to then
restart the pod.
But as the kafka stream state it's always running when the consumer leaves
the group, the app is still healthy in zombie state, so I need to manually
go and restart the pod.

Is this a bug? Or is there a way to check what is the stream consumer state
to then expose as healt check for my application?

This issue really happen randomly, usually all the Mondays. I'm using Kafka
2.8.1 and my app is made in kotlin.

This is the message I get before zombie state, then there are no
exceptions, errors or nothing more until I restart the pod manually.

Sending LeaveGroup request to coordinator
b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null)
due to consumer poll timeout has expired. This means the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time processing messages. You can address this
either by increasing max.poll.interval.ms or by reducing the maximum
size of batches returned in poll() with max.poll.records.


Thanks for the help.