[ 
https://issues.apache.org/jira/browse/KAFKA-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abdullah alkhawatrah updated KAFKA-14440:
-----------------------------------------
    Description: 
Hey,

I have a kafka streams service that aggregates events from multiple input 
topics (running in a k8s cluster). The topology has multiple FKJs. The input 
topics have around 7 billion events when the service was started from 
`earliest`.

The service has EOS enabled and 
`[transaction.timeout.ms|http://transaction.timeout.ms/]` is `600000`. 

The problem I am having is with frequent local state wipe-outs, this leads to 
very long rebalances. As can be seen from the attached images, local disk sizes 
go to ~ 0 very often. These wipe out are part of the EOS guarantee based on 
this log message: 
`State store transfer-store did not find checkpoint offsets while stores are 
not empty, since under EOS it has the risk of getting uncommitted data in 
stores we have to treat it as a task corruption error and wipe out the local 
state of task 1_8 before re-bootstrapping`
 

I noticed that this happens as a result of one of the following:
 * Process gets sigkill when running out of memory or on failure to shutdown 
gracefully on pod rotation for example, this explains the missing local 
checkpoint file, but for some reason I thought local checkpoint updates are 
frequent, so I expected to get part of the state to be reset but not the whole 
local state.
 * Although we have a  long transaction timeout config, this appears many times 
in the logs, after which kafka streams gets into error state. On startup, local 
checkpoint file is not found:

`Transiting to abortable error state due to 
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
attempted to produce with an old epoch.`
 

The service has 10 instances all having the same behaviour. The issue 
disappears when EOS is disabled.

The kafka cluster runs kafka 2.6, with minimum isr of 3.

 

 

  was:
Hey,

I have a kafka streams service that aggregates events from multiple input 
topics (running in a k8s cluster). The topology has multiple FKJs. The input 
topics have around 7 billion events when the service was started from 
`earliest`.

The service has EOS enabled and 
`[transaction.timeout.ms|http://transaction.timeout.ms/]` is `600000`. 

The problem I am having is with frequent local state wipe-outs, this leads to 
very long rebalances. As can be seen from the attached images, local disk sizes 
go to ~ 0 very often. These wipe out are part of the EOS guarantee based on 
this log message: 
State store transfer-store did not find checkpoint offsets while stores are not 
empty, since under EOS it has the risk of getting uncommitted data in stores we 
have to treat it as a task corruption error and wipe out the local state of 
task 1_8 before re-bootstrapping
 

I noticed that this happens as a result of one of the following:
 * Process gets sigkill when running out of memory or on failure to shutdown 
gracefully on pod rotation for example, this explains the missing local 
checkpoint file, but for some reason I thought local checkpoint updates are 
frequent, so I expected to get part of the state to be reset but not the whole 
local state.
 * Although we have a  long transaction timeout config, this appears many times 
in the logs, after which kafka streams gets into error state. On startup, local 
checkpoint file is not found:

Transiting to abortable error state due to 
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
attempted to produce with an old epoch.
 

The service has 10 instances all having the same behaviour. The issue 
disappears when EOS is disabled.

The kafka cluster runs kafka 2.6, with minimum isr of 3.

 

 


> Local state wipeout with EOS
> ----------------------------
>
>                 Key: KAFKA-14440
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14440
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.2.3
>            Reporter: Abdullah alkhawatrah
>            Priority: Major
>         Attachments: Screenshot 2022-12-02 at 09.26.27.png
>
>
> Hey,
> I have a kafka streams service that aggregates events from multiple input 
> topics (running in a k8s cluster). The topology has multiple FKJs. The input 
> topics have around 7 billion events when the service was started from 
> `earliest`.
> The service has EOS enabled and 
> `[transaction.timeout.ms|http://transaction.timeout.ms/]` is `600000`. 
> The problem I am having is with frequent local state wipe-outs, this leads to 
> very long rebalances. As can be seen from the attached images, local disk 
> sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee 
> based on this log message: 
> `State store transfer-store did not find checkpoint offsets while stores are 
> not empty, since under EOS it has the risk of getting uncommitted data in 
> stores we have to treat it as a task corruption error and wipe out the local 
> state of task 1_8 before re-bootstrapping`
>  
> I noticed that this happens as a result of one of the following:
>  * Process gets sigkill when running out of memory or on failure to shutdown 
> gracefully on pod rotation for example, this explains the missing local 
> checkpoint file, but for some reason I thought local checkpoint updates are 
> frequent, so I expected to get part of the state to be reset but not the 
> whole local state.
>  * Although we have a  long transaction timeout config, this appears many 
> times in the logs, after which kafka streams gets into error state. On 
> startup, local checkpoint file is not found:
> `Transiting to abortable error state due to 
> org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
> attempted to produce with an old epoch.`
>  
> The service has 10 instances all having the same behaviour. The issue 
> disappears when EOS is disabled.
> The kafka cluster runs kafka 2.6, with minimum isr of 3.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to