Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

Bruno Cadonna Wed, 13 Mar 2024 02:22:07 -0700

Hi Venkatesh,

Extending on what Matthias replied, a metadata refresh might trigger arebalance if the metadata changed. However, a metadata refresh that doesnot show a change in the metadata will not trigger a rebalance. In thiscontext, i.e., config METADATA_MAX_AGE_CONFIG, the metadata is themetadata about the cluster received by the client.

The metadata mentioned in the log messages you posted is metadata of thegroup to which the member (a.k.a. consumer, a.k.a. client) belongs. Thelog message originates from the broker (in contrastMETADATA_MAX_AGE_CONFIG is a client config). If the rebalance weretriggered by a cluster metadata change the log message should containsomething like "cached metadata has changed" as client reason [1].

Your log messages seem genuine log messages that are completely normalduring rebalance events.


How often do they happen?
What do you mean with stop-the-world rebalances?

Best,
Bruno

[1]https://github.com/apache/kafka/blob/f0087ac6a8a7b1005e9588e42b3679146bd3eb13/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L882C39-L882C66



On 3/13/24 2:34 AM, Venkatesh Nagarajan wrote:

Just want to share another variant of the log message which is also related to 
metadata and rebalancing but has a different client reason:

INFO [GroupCoordinator 3]: Preparing to rebalance group <group> in state 
PreparingRebalance with old generation nnn (__consumer_offsets-nn) (reason: Updating 
metadata for member <member> during Stable; client reason: triggered followup 
rebalance scheduled for 0) (kafka.coordinator.group.GroupCoordinator)

Thank you.

Kind regards,
Venkatesh

From: Venkatesh Nagarajan <venkatesh.nagara...@uts.edu.au>
Date: Wednesday, 13 March 2024 at 12:06 pm
To: users@kafka.apache.org <users@kafka.apache.org>
Subject: Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Thanks very much for your important inputs, Matthias.

I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw 
a lot of such rebalancing related messages in the MSK broker logs:

INFO [GroupCoordinator 2]: Preparing to rebalance group <group> in state 
PreparingRebalance with old generation nnnn (__consumer_offsets-nn) (reason: Updating 
metadata for member <member> during Stable; client reason: need to revoke partitions 
and re-join) (kafka.coordinator.group.GroupCoordinator)

I am guessing that the two are unrelated. If you have any suggestions on how to 
reduce such rebalancing, that will be very helpful.

Thank you very much.

Kind regards,
Venkatesh

From: Matthias J. Sax <mj...@apache.org>
Date: Tuesday, 12 March 2024 at 1:31 pm
To: users@kafka.apache.org <users@kafka.apache.org>
Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Without detailed logs (maybe even DEBUG) hard to say.

But from what you describe, it could be a metadata issue? Why are you
setting

METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make 
rebalances rare)


Refreshing metadata has nothing to do with rebalances, and a metadata
refresh does not trigger a rebalance.



-Matthias


On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote:

Hi all,

A Kafka Streams application sometimes stops consuming events during load 
testing. Please find below the details:

Details of the app:


    *   Kafka Streams Version: 3.5.1
    *   Kafka: AWS MSK v3.6.0
    *   Consumes events from 6 topics
    *   Calls APIs to enrich events
    *   Sometimes joins two streams
    *   Produces enriched events in output topics

Runs on AWS ECS:

    *   Each task has 10 streaming threads
    *   Autoscaling based on offset lags and a maximum of 6 ECS tasks
    *   Input topics have 60 partitions each to match 6 tasks * 10 threads
    *   Fairly good spread of events across all topic partitions using 
partitioning keys

Settings and configuration:


    *   At least once semantics
    *   MAX_POLL_RECORDS_CONFIG: 10
    *   APPLICATION_ID_CONFIG

// Make rebalances rare and prevent stop-the-world rebalances

    *   Static membership (using GROUP_INSTANCE_ID_CONFIG)
    *   METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to 
make rebalances rare)
    *   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
    *   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis

State store related settings:

    *   TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE
    *   STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L
    *   NUM_STANDBY_REPLICAS_CONFIG: 1


Symptoms:
The symptoms mentioned below occur during load tests:

Scenario# 1:
Steady input event stream

Observations:

    *   Gradually increasing offset lags which shouldn't happen normally as the 
streaming app is quite fast
    *   Events get processed

Scenario# 2:
No input events after the load test stops producing events

Observations:

    *   Offset lag stuck at ~5k
    *   Stable consumer group
    *   No events processed
    *   No errors or messages in the logs


Scenario# 3:
Restart the app when it stops processing events although offset lags are not 
zero

Observations:

    *   Offset lags start reducing and events start getting processed

Scenario# 4:
Transient errors occur while processing events


    *   A custom exception handler that implements 
StreamsUncaughtExceptionHandler returns 
StreamThreadExceptionResponse.REPLACE_THREAD in the handle method
    *   If transient errors keep occurring occasionally and threads get 
replaced, the problem of the app stalling disappears.
    *   But if transient errors don't occur, the app tends to stall and I need 
to manually restart it


Summary:

    *   It appears that some streaming threads stall after processing for a 
while.
    *   It is difficult to change log level for Kafka Streams from ERROR to 
INFO as it starts producing a lot of log messages especially during load tests.
    *   I haven't yet managed to push Kafka streams metrics into AWS OTEL 
collector to get more insights.

Can you please let me know if any Kafka Streams config settings need changing? 
Should I reduce the values of any of these settings to help trigger rebalancing 
early and hence assign partitions to members that are active:


    *   METADATA_MAX_AGE_CONFIG: 5 hours in millis (to make rebalances rare)
    *   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
    *   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis

Should I get rid of static membership – this may increase rebalancing but may 
be okay if it can prevent stalled threads from appearing as active members

Should I try upgrading Kafka Streams to v3.6.0 or v3.7.0? Hoping that v3.7.0 
will be compatible with AWS MSK v3.6.0.


Thank you very much.

Kind regards,
Venkatesh

UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any 
accompanying attachments may contain confidential information. If you are not 
the intended recipient, do not read, use, disseminate, distribute or copy this 
message or attachments. If you have received this message in error, please 
notify the sender immediately and delete this message. Any views expressed in 
this message are those of the individual sender, except where the sender 
expressly, and with authority, states them to be the views of the University of 
Technology Sydney. Before opening any attachments, please check them for 
viruses and defects. Think. Green. Do. Please consider the environment before 
printing this email.

Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

Reply via email to