Hi all, we were testing Kafka cluster outages by randomly crashing broker nodes (1 of 3 for instance) while still keeping majority of replicas available.
Time to time our kafka-stream app crashing with exception: [ERROR] [StreamThread-1] [org.apache.kafka.streams.processor.internals.StreamThread] [stream-thread [StreamThread-1] Failed while executing StreamTask 0_1 due to flush state: ] org.apache.kafka.streams.errors.StreamsException: task [0_1] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:422) at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555) at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501) at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551) at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449) at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for audit-metrics-collector-store_context.operation-message.bucketStartMs-message.dc-repartition-0: 30026 ms has passed since batch creation plus linger time after that clearly only restart of app and it continues processing. We believe it is correlated with our outage testing and question is: what are recommended options for make kafka-stream application more resilient to broker crashes? Thank you.