Abdullah alkhawatrah created KAFKA-14440:
--------------------------------------------

             Summary: Local state wipeout with EOS
                 Key: KAFKA-14440
                 URL: https://issues.apache.org/jira/browse/KAFKA-14440
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.2.3
            Reporter: Abdullah alkhawatrah
         Attachments: Screenshot 2022-12-02 at 09.26.27.png

Hey,

I have a kafka streams service that aggregates events from multiple input 
topics (running in a k8s cluster). The topology has multiple FKJs. The input 
topics have around 7 billion events when the service was started from 
`earliest`.

The service has EOS enabled and 
`[transaction.timeout.ms|http://transaction.timeout.ms/]` is `600000`. 

The problem I am having is with frequent local state wipe-outs, this leads to 
very long rebalances. As can be seen from the attached images, local disk sizes 
go to ~ 0 very often. These wipe out are part of the EOS guarantee based on 
this log message: 
State store transfer-store did not find checkpoint offsets while stores are not 
empty, since under EOS it has the risk of getting uncommitted data in stores we 
have to treat it as a task corruption error and wipe out the local state of 
task 1_8 before re-bootstrapping
 

I noticed that this happens as a result of one of the following:
 * Process gets sigkill when running out of memory or on failure to shutdown 
gracefully on pod rotation for example, this explains the missing local 
checkpoint file, but for some reason I thought local checkpoint updates are 
frequent, so I expected to get part of the state to be reset but not the whole 
local state.
 * Although we have a  long transaction timeout config, this appears many times 
in the logs, after which kafka streams gets into error state. On startup, local 
checkpoint file is not found:

Transiting to abortable error state due to 
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
attempted to produce with an old epoch.
 

The service has 10 instances all having the same behaviour. The issue 
disappears when EOS is disabled.

The kafka cluster runs kafka 2.6, with minimum isr of 3.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to