Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

Venkatesh Nagarajan Tue, 12 Mar 2024 18:07:42 -0700

Thanks very much for your important inputs, Matthias.

I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw 
a lot of such rebalancing related messages in the MSK broker logs:


INFO [GroupCoordinator 2]: Preparing to rebalance group <group> in state 
PreparingRebalance with old generation nnnn (__consumer_offsets-nn) (reason: 
Updating metadata for member <member> during Stable; client reason: need to 
revoke partitions and re-join) (kafka.coordinator.group.GroupCoordinator)

I am guessing that the two are unrelated. If you have any suggestions on how to 
reduce such rebalancing, that will be very helpful.

Thank you very much.

Kind regards,
Venkatesh

From: Matthias J. Sax <mj...@apache.org>
Date: Tuesday, 12 March 2024 at 1:31 pm
To: users@kafka.apache.org <users@kafka.apache.org>
Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Without detailed logs (maybe even DEBUG) hard to say.

But from what you describe, it could be a metadata issue? Why are you
setting

> METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make 
> rebalances rare)

Refreshing metadata has nothing to do with rebalances, and a metadata
refresh does not trigger a rebalance.



-Matthias


On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote:
> Hi all,
>
> A Kafka Streams application sometimes stops consuming events during load 
> testing. Please find below the details:
>
> Details of the app:
>
>
>    *   Kafka Streams Version: 3.5.1
>    *   Kafka: AWS MSK v3.6.0
>    *   Consumes events from 6 topics
>    *   Calls APIs to enrich events
>    *   Sometimes joins two streams
>    *   Produces enriched events in output topics
>
> Runs on AWS ECS:
>
>    *   Each task has 10 streaming threads
>    *   Autoscaling based on offset lags and a maximum of 6 ECS tasks
>    *   Input topics have 60 partitions each to match 6 tasks * 10 threads
>    *   Fairly good spread of events across all topic partitions using 
> partitioning keys
>
> Settings and configuration:
>
>
>    *   At least once semantics
>    *   MAX_POLL_RECORDS_CONFIG: 10
>    *   APPLICATION_ID_CONFIG
>
> // Make rebalances rare and prevent stop-the-world rebalances
>
>    *   Static membership (using GROUP_INSTANCE_ID_CONFIG)
>    *   METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to 
> make rebalances rare)
>    *   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
>    *   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis
>
> State store related settings:
>
>    *   TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE
>    *   STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L
>    *   NUM_STANDBY_REPLICAS_CONFIG: 1
>
>
> Symptoms:
> The symptoms mentioned below occur during load tests:
>
> Scenario# 1:
> Steady input event stream
>
> Observations:
>
>    *   Gradually increasing offset lags which shouldn't happen normally as 
> the streaming app is quite fast
>    *   Events get processed
>
> Scenario# 2:
> No input events after the load test stops producing events
>
> Observations:
>
>    *   Offset lag stuck at ~5k
>    *   Stable consumer group
>    *   No events processed
>    *   No errors or messages in the logs
>
>
> Scenario# 3:
> Restart the app when it stops processing events although offset lags are not 
> zero
>
> Observations:
>
>    *   Offset lags start reducing and events start getting processed
>
> Scenario# 4:
> Transient errors occur while processing events
>
>
>    *   A custom exception handler that implements 
> StreamsUncaughtExceptionHandler returns 
> StreamThreadExceptionResponse.REPLACE_THREAD in the handle method
>    *   If transient errors keep occurring occasionally and threads get 
> replaced, the problem of the app stalling disappears.
>    *   But if transient errors don't occur, the app tends to stall and I need 
> to manually restart it
>
>
> Summary:
>
>    *   It appears that some streaming threads stall after processing for a 
> while.
>    *   It is difficult to change log level for Kafka Streams from ERROR to 
> INFO as it starts producing a lot of log messages especially during load 
> tests.
>    *   I haven't yet managed to push Kafka streams metrics into AWS OTEL 
> collector to get more insights.
>
> Can you please let me know if any Kafka Streams config settings need 
> changing? Should I reduce the values of any of these settings to help trigger 
> rebalancing early and hence assign partitions to members that are active:
>
>
>    *   METADATA_MAX_AGE_CONFIG: 5 hours in millis (to make rebalances rare)
>    *   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
>    *   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis
>
> Should I get rid of static membership – this may increase rebalancing but may 
> be okay if it can prevent stalled threads from appearing as active members
>
> Should I try upgrading Kafka Streams to v3.6.0 or v3.7.0? Hoping that v3.7.0 
> will be compatible with AWS MSK v3.6.0.
>
>
> Thank you very much.
>
> Kind regards,
> Venkatesh
>
> UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any 
> accompanying attachments may contain confidential information. If you are not 
> the intended recipient, do not read, use, disseminate, distribute or copy 
> this message or attachments. If you have received this message in error, 
> please notify the sender immediately and delete this message. Any views 
> expressed in this message are those of the individual sender, except where 
> the sender expressly, and with authority, states them to be the views of the 
> University of Technology Sydney. Before opening any attachments, please check 
> them for viruses and defects. Think. Green. Do. Please consider the 
> environment before printing this email.
>

Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

Reply via email to