[ https://issues.apache.org/jira/browse/KAFKA-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias J. Sax resolved KAFKA-14440. ------------------------------------- Resolution: Duplicate > Local state wipeout with EOS > ---------------------------- > > Key: KAFKA-14440 > URL: https://issues.apache.org/jira/browse/KAFKA-14440 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 3.2.3 > Reporter: Abdullah alkhawatrah > Priority: Major > Attachments: Screenshot 2022-12-02 at 09.26.27.png > > > Hey, > I have a kafka streams service that aggregates events from multiple input > topics (running in a k8s cluster). The topology has multiple FKJs. The input > topics have around 7 billion events when the service was started from > `earliest`. > The service has EOS enabled and > {code:java} > transaction.timeout.ms: 600000{code} > The problem I am having is with frequent local state wipe-outs, this leads to > very long rebalances. As can be seen from the attached images, local disk > sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee > based on this log message: > {code:java} > State store transfer-store did not find checkpoint offsets while stores are > not empty, since under EOS it has the risk of getting uncommitted data in > stores we have to treat it as a task corruption error and wipe out the local > state of task 1_8 before re-bootstrapping{code} > > I noticed that this happens as a result of one of the following: > * Process gets sigkill when running out of memory or on failure to shutdown > gracefully on pod rotation for example, this explains the missing local > checkpoint file, but for some reason I thought local checkpoint updates are > frequent, so I expected to get part of the state to be reset but not the > whole local state. > * Although we have a long transaction timeout config, this appears many > times in the logs, after which kafka streams gets into error state. On > startup, local checkpoint file is not found: > {code:java} > Transiting to abortable error state due to > org.apache.kafka.common.errors.InvalidProducerEpochException: Producer > attempted to produce with an old epoch.{code} > The service has 10 instances all having the same behaviour. The issue > disappears when EOS is disabled. > The kafka cluster runs kafka 2.6, with minimum isr of 3. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)