Just want to share another variant of the log message which is also related to metadata and rebalancing but has a different client reason:
INFO [GroupCoordinator 3]: Preparing to rebalance group <group> in state PreparingRebalance with old generation nnn (__consumer_offsets-nn) (reason: Updating metadata for member <member> during Stable; client reason: triggered followup rebalance scheduled for 0) (kafka.coordinator.group.GroupCoordinator) Thank you. Kind regards, Venkatesh From: Venkatesh Nagarajan <venkatesh.nagara...@uts.edu.au> Date: Wednesday, 13 March 2024 at 12:06 pm To: users@kafka.apache.org <users@kafka.apache.org> Subject: Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled Thanks very much for your important inputs, Matthias. I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw a lot of such rebalancing related messages in the MSK broker logs: INFO [GroupCoordinator 2]: Preparing to rebalance group <group> in state PreparingRebalance with old generation nnnn (__consumer_offsets-nn) (reason: Updating metadata for member <member> during Stable; client reason: need to revoke partitions and re-join) (kafka.coordinator.group.GroupCoordinator) I am guessing that the two are unrelated. If you have any suggestions on how to reduce such rebalancing, that will be very helpful. Thank you very much. Kind regards, Venkatesh From: Matthias J. Sax <mj...@apache.org> Date: Tuesday, 12 March 2024 at 1:31 pm To: users@kafka.apache.org <users@kafka.apache.org> Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled Without detailed logs (maybe even DEBUG) hard to say. But from what you describe, it could be a metadata issue? Why are you setting > METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make > rebalances rare) Refreshing metadata has nothing to do with rebalances, and a metadata refresh does not trigger a rebalance. -Matthias On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote: > Hi all, > > A Kafka Streams application sometimes stops consuming events during load > testing. Please find below the details: > > Details of the app: > > > * Kafka Streams Version: 3.5.1 > * Kafka: AWS MSK v3.6.0 > * Consumes events from 6 topics > * Calls APIs to enrich events > * Sometimes joins two streams > * Produces enriched events in output topics > > Runs on AWS ECS: > > * Each task has 10 streaming threads > * Autoscaling based on offset lags and a maximum of 6 ECS tasks > * Input topics have 60 partitions each to match 6 tasks * 10 threads > * Fairly good spread of events across all topic partitions using > partitioning keys > > Settings and configuration: > > > * At least once semantics > * MAX_POLL_RECORDS_CONFIG: 10 > * APPLICATION_ID_CONFIG > > // Make rebalances rare and prevent stop-the-world rebalances > > * Static membership (using GROUP_INSTANCE_ID_CONFIG) > * METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to > make rebalances rare) > * MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis > * SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis > > State store related settings: > > * TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE > * STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L > * NUM_STANDBY_REPLICAS_CONFIG: 1 > > > Symptoms: > The symptoms mentioned below occur during load tests: > > Scenario# 1: > Steady input event stream > > Observations: > > * Gradually increasing offset lags which shouldn't happen normally as > the streaming app is quite fast > * Events get processed > > Scenario# 2: > No input events after the load test stops producing events > > Observations: > > * Offset lag stuck at ~5k > * Stable consumer group > * No events processed > * No errors or messages in the logs > > > Scenario# 3: > Restart the app when it stops processing events although offset lags are not > zero > > Observations: > > * Offset lags start reducing and events start getting processed > > Scenario# 4: > Transient errors occur while processing events > > > * A custom exception handler that implements > StreamsUncaughtExceptionHandler returns > StreamThreadExceptionResponse.REPLACE_THREAD in the handle method > * If transient errors keep occurring occasionally and threads get > replaced, the problem of the app stalling disappears. > * But if transient errors don't occur, the app tends to stall and I need > to manually restart it > > > Summary: > > * It appears that some streaming threads stall after processing for a > while. > * It is difficult to change log level for Kafka Streams from ERROR to > INFO as it starts producing a lot of log messages especially during load > tests. > * I haven't yet managed to push Kafka streams metrics into AWS OTEL > collector to get more insights. > > Can you please let me know if any Kafka Streams config settings need > changing? Should I reduce the values of any of these settings to help trigger > rebalancing early and hence assign partitions to members that are active: > > > * METADATA_MAX_AGE_CONFIG: 5 hours in millis (to make rebalances rare) > * MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis > * SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis > > Should I get rid of static membership – this may increase rebalancing but may > be okay if it can prevent stalled threads from appearing as active members > > Should I try upgrading Kafka Streams to v3.6.0 or v3.7.0? Hoping that v3.7.0 > will be compatible with AWS MSK v3.6.0. > > > Thank you very much. > > Kind regards, > Venkatesh > > UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any > accompanying attachments may contain confidential information. If you are not > the intended recipient, do not read, use, disseminate, distribute or copy > this message or attachments. If you have received this message in error, > please notify the sender immediately and delete this message. Any views > expressed in this message are those of the individual sender, except where > the sender expressly, and with authority, states them to be the views of the > University of Technology Sydney. Before opening any attachments, please check > them for viruses and defects. Think. Green. Do. Please consider the > environment before printing this email. >