[ 
https://issues.apache.org/jira/browse/KAFKA-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax resolved KAFKA-14440.
-------------------------------------
    Resolution: Duplicate

> Local state wipeout with EOS
> ----------------------------
>
>                 Key: KAFKA-14440
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14440
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.2.3
>            Reporter: Abdullah alkhawatrah
>            Priority: Major
>         Attachments: Screenshot 2022-12-02 at 09.26.27.png
>
>
> Hey,
> I have a kafka streams service that aggregates events from multiple input 
> topics (running in a k8s cluster). The topology has multiple FKJs. The input 
> topics have around 7 billion events when the service was started from 
> `earliest`.
> The service has EOS enabled and 
> {code:java}
> transaction.timeout.ms: 600000{code}
> The problem I am having is with frequent local state wipe-outs, this leads to 
> very long rebalances. As can be seen from the attached images, local disk 
> sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee 
> based on this log message: 
> {code:java}
> State store transfer-store did not find checkpoint offsets while stores are 
> not empty, since under EOS it has the risk of getting uncommitted data in 
> stores we have to treat it as a task corruption error and wipe out the local 
> state of task 1_8 before re-bootstrapping{code}
>  
> I noticed that this happens as a result of one of the following:
>  * Process gets sigkill when running out of memory or on failure to shutdown 
> gracefully on pod rotation for example, this explains the missing local 
> checkpoint file, but for some reason I thought local checkpoint updates are 
> frequent, so I expected to get part of the state to be reset but not the 
> whole local state.
>  * Although we have a  long transaction timeout config, this appears many 
> times in the logs, after which kafka streams gets into error state. On 
> startup, local checkpoint file is not found:
> {code:java}
> Transiting to abortable error state due to 
> org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
> attempted to produce with an old epoch.{code}
> The service has 10 instances all having the same behaviour. The issue 
> disappears when EOS is disabled.
> The kafka cluster runs kafka 2.6, with minimum isr of 3.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to