Hi all,

we were testing Kafka cluster outages by randomly crashing broker nodes (1
of 3 for instance) while still keeping majority of replicas available.

Time to time our kafka-stream app crashing with exception:

[ERROR] [StreamThread-1]
[org.apache.kafka.streams.processor.internals.StreamThread] [stream-thread
[StreamThread-1] Failed while executing StreamTask 0_1 due to flush state:
] org.apache.kafka.streams.errors.StreamsException: task [0_1] exception
caught when producing at
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at
org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:422)
at
org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555)
at
org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501)
at
org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551)
at
org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449)
at
org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391)
at
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1
record(s) for
audit-metrics-collector-store_context.operation-message.bucketStartMs-message.dc-repartition-0:
30026 ms has passed since batch creation plus linger time

after that clearly only restart of app and it continues processing.

We believe it is correlated with our outage testing and question is: what
are recommended options for make kafka-stream application more resilient to
broker crashes?

Thank you.

Reply via email to