[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645330#comment-17645330
 ] 

John Gray edited comment on KAFKA-14172 at 12/9/22 2:53 PM:
------------------------------------------------------------

I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1 w/ broker 3.1.0. I amend my 
previous statement that this is related to our brokers rolling, it actually 
seems it can happen any time our restore consumers in EOS apps w/ big state try 
to restore said big state. Sadly, setting the acceptable.lag config does not 
help us because we do not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).


was (Author: gray.john):
I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1. I amend my previous statement 
that this is related to our brokers rolling, it actually seems it can happen 
any time our restore consumers in EOS apps w/ big state try to restore said big 
state. Sadly, setting the acceptable.lag config does not help us because we do 
not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).

> bug: State stores lose state when tasks are reassigned under EOS wit…
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-14172
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14172
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.1.1
>            Reporter: Martin Hørslev
>            Priority: Critical
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to