Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

2024-03-18 Thread Venkatesh Nagarajan
Thanks very much for sharing the links and for your important inputs, Bruno!

>  We recommend to use as many stream threads as cores on the compute node 
> where the Kafka Streams client is run. How many Kafka Streams tasks do you 
> have to distribute over the clients?

We use 1vCPU (probably 1 core) per Kafka Streams Client (ECS Task). Each 
client/ECS Task runs 10 streaming threads and the CPU utilisation is just 4% on 
an average. It increases when transient errors occur as they require retries 
and threads to be replaced.

We run a maximum of 6 clients/ECS Tasks when the offset lags are high. The 
input topics have 60 partitions each and this matches  (total number of 
clients/ECS Tasks i.e. 6) * ( Streaming threads per client/ECS task i.e.10).

With the 1 streaming thread per core approach, we will need 60 vCPUs/cores. As 
I mentioned above, we have observed 10 threads using just 4% of 1 vCPU/core on 
an average. It may be difficult to justify provisioning more cores as it will 
be expensive and because Kafka Streams recovers from failures in acquiring 
locks.

Please feel free to correct me and/or share your thoughts.

Thank you.

Kind regards,
Venkatesh

From: Bruno Cadonna 
Date: Friday, 15 March 2024 at 8:47 PM
To: users@kafka.apache.org 
Subject: Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Hi Venkatesh,

As you discovered, in Kafka Streams 3.5.1 there is no stop-the-world
rebalancing.

Static group member is helpful when Kafka Streams clients are restarted
as you pointed out.

> ERROR org.apache.kafka.streams.processor.internals.StandbyTask -
stream-thread [-StreamThread-1] standby-task [1_32] Failed to
acquire lock while closing the state store for STANDBY task

This error (and some others about lock acquisition) happens when a
stream thread wants to lock the state directory for a task but the
stream thread on the same Kafka Streams client has not releases the lock
yet. And yes, Kafka Streams handles them.

30 and 60 stream threads is a lot for one Kafka Streams client. We
recommend to use as many stream threads as cores on the compute node
where the Kafka Streams client is run. How many Kafka Streams tasks do
you have to distribute over the clients?

> Would you consider this level of rebalancing to be normal?

The rate of rebalance events seems high indeed. However, the log
messages you posted in one of your last e-mails are normal during a
rebalance and they have nothing to do with METADATA_MAX_AGE_CONFIG.

I do not know the metric SumOffsetLag. Judging from a quick search on
the internet, I think it is a MSK specific metric.
https://repost.aws/questions/QUthnU3gycT-qj3Mtb-ekmRA/msk-metric-sumoffsetlag-how-it-works
Under the link you can also find some other metrics that you can use.

The following talk might help you debugging your rebalance issues:

https://www.confluent.io/events/kafka-summit-london-2023/kafka-streams-rebalances-and-assignments-the-whole-story/


Best,
Bruno

On 3/14/24 11:11 PM, Venkatesh Nagarajan wrote:
> Just want to make a correction, Bruno - My understanding is that Kafka 
> Streams 3.5.1 uses Incremental Cooperative Rebalancing which seems to help 
> reduce the impact of rebalancing caused by autoscaling etc.:
>
> https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/
>
> Static group membership may also have a role to play especially if the ECS 
> tasks get restarted for some reason.
>
>
> I also want to mention to you about this error which occurred 759 times 
> during the 13 hour load test:
>
> ERROR org.apache.kafka.streams.processor.internals.StandbyTask - 
> stream-thread [-StreamThread-1] standby-task [1_32] Failed to acquire 
> lock while closing the state store for STANDBY task
>
> I think Kafka Streams automatically recovers from this. Also, I have seen 
> this error to increase when the number of streaming threads is high (30 or 60 
> threads). So I use just 10 threads per ECS task.
>
> Kind regards,
> Venkatesh
>
> From: Venkatesh Nagarajan 
> Date: Friday, 15 March 2024 at 8:30 AM
> To: users@kafka.apache.org 
> Subject: Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
> Apologies for the delay in responding to you, Bruno. Thank you very much for 
> your important inputs.
>
> Just searched for log messages in the MSK broker logs pertaining to 
> rebalancing and updating of metadata for the consumer group and found 412 
> occurrences in a 13 hour period. During this time, a load test was run and 
> around 270k events were processed. Would you consider this level of 
> rebalancing to be normal?
>
> Also, I need to mention that when offset lags increase, autoscaling creates 
> additional ECS tasks to help with 

KRaft Migration and Kafka Controller behaviour

2024-03-18 Thread Sanaa Syed
Hello,

I've begun migrating some of my Zookeeper Kafka clusters to KRaft. A
behaviour I've noticed twice across two different kafka cluster
environments is after provisioning a kraft controller quorum in migration
mode, it is possible for a kafka broker to become an active controller
alongside a kraft controller broker.

For example, here are the steps I follow and the behaviour I notice (I'm
currently using Kafka v3.6):
1. Enable the KRaft migration on the existing Kafka brokers (set the
`controller.quorum.voter`, `controller.listener.names` and
`zookeeper.metadata.migration.enable` configs in the server.properties
file).
2. Deploy a kraft controller statefulset and service with the migration
enabled so that data is copied over from Zookeeper and we enter a
dual-write mode.
3. After a few minutes, I see that the migration has completed (it's a
pretty small cluster). At this point, the kraft controller pod has been
elected to be the controller (and I see this in zookeeper when I run `get
/controller`). If the kafka brokers or kraft controller pods are restarted
at any point after the migration is completed, a kafka broker is elected to
be the controller and is reflected in zookeeper as well. Now, I have two
active controllers - 1 is a kafka broker and 1 is a kraft controller broker.

A couple questions I have:
1. Is this the expected behaviour? If so, how long after a migration has
been completed should we hold off on restarting kafka brokers to avoid this
situation?
2. Why is it possible for a kafka broker to be a controller again
post-migration?
3. How do we come back to a state where a kraft controller broker is the
only controller again in the least disruptive way possible?

Thank you,
Sanaa