[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2023-03-08 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698180#comment-17698180
 ] 

Guozhang Wang commented on KAFKA-14172:
---

[~gray.john][~Horslev] I took a deep look into this issue and I think I found 
the culprit. Here's a short summary:

1. When standbys are enabled, Kafka Streams could recycle a standby task (and 
its state stores) into an active, and vice versa.
2. When caching is enabled, we would bypass the caching layer when updating a 
standby task (i.e. via putInternal).

And these two together combined would cause an issue. Take a concrete example 
following https://github.com/apache/kafka/pull/12540's demo: let's say we have 
a task A with a cached state store S. 

* For a given host, originally the task was hosted as an active.
* A rebalance happens, and that task was recycled into a standby. At that time 
the cache is flushed, so that the underlying store and the cache layer are 
consistent, let's assume they are S1 (version 1).
* The standby task was updated for a period of time, where updates are directly 
written into the underlying store. Now the underlying store is S2 while the 
caching layer is still S1.
* A second rebalance happens, and that task was recycled again into an active. 
Then when that task is normally processing, a read into the store would hit the 
cache layer first, and very likely read out an older versioned S1 instead of 
S2. As a result, we have a duplicate: more specifically in the above PR's 
example, the {{count}} store would return an old counter and hence cause the 
resulted ID inferred from counter being used twice.

That also explains why the test would not fail if caching is disabled, or 
standby replicas are disabled (tested locally); I think this test could still 
fail even when acceptable lag is set to 0, but it is less likely to have a 
standby -> active and then -> standby again so maybe people may not easily 
observe it.

I have a hack fix (note, this is not for merging as it is just a hack) that is 
inherited from [~Horslev]'s integration test, which would clear the cache upon 
flushing it (which is called when the task manager is flushed). With this fix 
the test no longer fails.

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Critical
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-12-09 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645330#comment-17645330
 ] 

John Gray commented on KAFKA-14172:
---

I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. I amend my previous statement that this is related to our brokers 
rolling, it actually seems it can happen any time our restore consumers in EOS 
apps w/ big state try to restore said big state. Sadly, setting the 
acceptable.lag config does not help us because we do not have the extra space 
to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-02 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599478#comment-17599478
 ] 

John Gray commented on KAFKA-14172:
---

[~Horslev] we utilize static membership for our stateful apps, so for us the 
cluster upgrades seem to be about the only time they really rebalance. So I do 
think we lose data because of the rebalancing EOS consumers. The biggest 
difference I think is that we don't use standby replicas, yet still are getting 
data loss during restores.

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599294#comment-17599294
 ] 

Martin Hørslev commented on KAFKA-14172:


[~gray.john] thanks for adding your experience. I agree the scenarios seems 
similar although the setup you describe seems different. 
If I understand your issue correctly then it seems related to rolling upgrades 
of your Kafka cluster ? 
This issue is triggered solely by adding new stream applications. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599002#comment-17599002
 ] 

John Gray commented on KAFKA-14172:
---

I know next to nothing about the internal workings of Kafka, sadly, but I am 
noticing that KAFKA-12486 was introduced in 3.1.0, which is the version I 
started noticing problems. I notice you helped out with that Jira, 
[~ableegoldman] , is there any possible way in your mind it might cause 
weirdness with state restoration? 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598956#comment-17598956
 ] 

John Gray commented on KAFKA-14172:
---

My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)