Without detailed logs (maybe even DEBUG) hard to say.
But from what you describe, it could be a metadata issue? Why are you setting
METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make rebalances rare)
Refreshing metadata has nothing to do with rebalances, and a metadata refresh does not trigger a rebalance.
-Matthias On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote:
Hi all, A Kafka Streams application sometimes stops consuming events during load testing. Please find below the details: Details of the app: * Kafka Streams Version: 3.5.1 * Kafka: AWS MSK v3.6.0 * Consumes events from 6 topics * Calls APIs to enrich events * Sometimes joins two streams * Produces enriched events in output topics Runs on AWS ECS: * Each task has 10 streaming threads * Autoscaling based on offset lags and a maximum of 6 ECS tasks * Input topics have 60 partitions each to match 6 tasks * 10 threads * Fairly good spread of events across all topic partitions using partitioning keys Settings and configuration: * At least once semantics * MAX_POLL_RECORDS_CONFIG: 10 * APPLICATION_ID_CONFIG // Make rebalances rare and prevent stop-the-world rebalances * Static membership (using GROUP_INSTANCE_ID_CONFIG) * METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make rebalances rare) * MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis * SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis State store related settings: * TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE * STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L * NUM_STANDBY_REPLICAS_CONFIG: 1 Symptoms: The symptoms mentioned below occur during load tests: Scenario# 1: Steady input event stream Observations: * Gradually increasing offset lags which shouldn't happen normally as the streaming app is quite fast * Events get processed Scenario# 2: No input events after the load test stops producing events Observations: * Offset lag stuck at ~5k * Stable consumer group * No events processed * No errors or messages in the logs Scenario# 3: Restart the app when it stops processing events although offset lags are not zero Observations: * Offset lags start reducing and events start getting processed Scenario# 4: Transient errors occur while processing events * A custom exception handler that implements StreamsUncaughtExceptionHandler returns StreamThreadExceptionResponse.REPLACE_THREAD in the handle method * If transient errors keep occurring occasionally and threads get replaced, the problem of the app stalling disappears. * But if transient errors don't occur, the app tends to stall and I need to manually restart it Summary: * It appears that some streaming threads stall after processing for a while. * It is difficult to change log level for Kafka Streams from ERROR to INFO as it starts producing a lot of log messages especially during load tests. * I haven't yet managed to push Kafka streams metrics into AWS OTEL collector to get more insights. Can you please let me know if any Kafka Streams config settings need changing? Should I reduce the values of any of these settings to help trigger rebalancing early and hence assign partitions to members that are active: * METADATA_MAX_AGE_CONFIG: 5 hours in millis (to make rebalances rare) * MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis * SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis Should I get rid of static membership – this may increase rebalancing but may be okay if it can prevent stalled threads from appearing as active members Should I try upgrading Kafka Streams to v3.6.0 or v3.7.0? Hoping that v3.7.0 will be compatible with AWS MSK v3.6.0. Thank you very much. Kind regards, Venkatesh UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any accompanying attachments may contain confidential information. If you are not the intended recipient, do not read, use, disseminate, distribute or copy this message or attachments. If you have received this message in error, please notify the sender immediately and delete this message. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views of the University of Technology Sydney. Before opening any attachments, please check them for viruses and defects. Think. Green. Do. Please consider the environment before printing this email.