[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-12-09 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645330#comment-17645330
 ] 

John Gray edited comment on KAFKA-14172 at 12/9/22 2:53 PM:


I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1 w/ broker 3.1.0. I amend my 
previous statement that this is related to our brokers rolling, it actually 
seems it can happen any time our restore consumers in EOS apps w/ big state try 
to restore said big state. Sadly, setting the acceptable.lag config does not 
help us because we do not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).


was (Author: gray.john):
I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1. I amend my previous statement 
that this is related to our brokers rolling, it actually seems it can happen 
any time our restore consumers in EOS apps w/ big state try to restore said big 
state. Sadly, setting the acceptable.lag config does not help us because we do 
not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Critical
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-12-09 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645330#comment-17645330
 ] 

John Gray edited comment on KAFKA-14172 at 12/9/22 2:53 PM:


I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1 w/ broker 3.1.0. I amend my 
previous statement that this is related to our brokers rolling, it actually 
seems it can happen any time our restore consumers in EOS apps w/ big state try 
to restore said big state. Sadly, setting the acceptable.lag config does not 
help us because we do not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which then fixes 
the issue. Something is _very_ wrong with these restore consumers (or I am 
doing something horribly wrong in this app, although we never had this problem 
before Kafka 3.0.0 or 3.1.0).


was (Author: gray.john):
I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1 w/ broker 3.1.0. I amend my 
previous statement that this is related to our brokers rolling, it actually 
seems it can happen any time our restore consumers in EOS apps w/ big state try 
to restore said big state. Sadly, setting the acceptable.lag config does not 
help us because we do not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Critical
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-12-09 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645330#comment-17645330
 ] 

John Gray edited comment on KAFKA-14172 at 12/9/22 2:52 PM:


I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. Is still a problem even on Streams 3.3.1. I amend my previous statement 
that this is related to our brokers rolling, it actually seems it can happen 
any time our restore consumers in EOS apps w/ big state try to restore said big 
state. Sadly, setting the acceptable.lag config does not help us because we do 
not have the extra space to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).


was (Author: gray.john):
I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. I amend my previous statement that this is related to our brokers 
rolling, it actually seems it can happen any time our restore consumers in EOS 
apps w/ big state try to restore said big state. Sadly, setting the 
acceptable.lag config does not help us because we do not have the extra space 
to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Critical
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-12-09 Thread John Gray (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray updated KAFKA-14172:
--
Priority: Critical  (was: Major)

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Critical
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-12-09 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645330#comment-17645330
 ] 

John Gray commented on KAFKA-14172:
---

I do hope this bug gets some eyes eventually, it has been pretty devastating 
for us. I amend my previous statement that this is related to our brokers 
rolling, it actually seems it can happen any time our restore consumers in EOS 
apps w/ big state try to restore said big state. Sadly, setting the 
acceptable.lag config does not help us because we do not have the extra space 
to run standby threads/replicas. 

We actually had to resort to dumping our keys to a sqlserver and then querying 
for "should this key exist?" when we are pulling keys from our state store. If 
the state store returns nothing and sqlserver says it has seen that key before, 
we kill the app, which then causes us to pull the state again, which usually 
fixes the issue. Something is _very_ wrong with these restore consumers (or I 
am doing something horribly wrong in this app, although we never had this 
problem before Kafka 3.0.0 or 3.1.0).

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-02 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599478#comment-17599478
 ] 

John Gray edited comment on KAFKA-14172 at 9/2/22 5:58 PM:
---

[~Horslev] we utilize static membership for our stateful apps, so for us the 
cluster upgrades seem to be about the only time they really rebalance. So I do 
think we lose data because of the rebalancing EOS consumers. The biggest 
difference I think is that we don't use standby replicas, yet still are getting 
data loss during restores. I think having a standby replica with 0 
max.recovery.lag would be the only solution to getting around this bug since I 
believe it is the restore consumers dropping the ball here, but, alas, we don't 
have the extra disk space for standby replicas.


was (Author: gray.john):
[~Horslev] we utilize static membership for our stateful apps, so for us the 
cluster upgrades seem to be about the only time they really rebalance. So I do 
think we lose data because of the rebalancing EOS consumers. The biggest 
difference I think is that we don't use standby replicas, yet still are getting 
data loss during restores.

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-02 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599478#comment-17599478
 ] 

John Gray commented on KAFKA-14172:
---

[~Horslev] we utilize static membership for our stateful apps, so for us the 
cluster upgrades seem to be about the only time they really rebalance. So I do 
think we lose data because of the rebalancing EOS consumers. The biggest 
difference I think is that we don't use standby replicas, yet still are getting 
data loss during restores.

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599002#comment-17599002
 ] 

John Gray edited comment on KAFKA-14172 at 9/1/22 2:56 PM:
---

I know next to nothing about the internal workings of Kafka, sadly, but I am 
noticing that KAFKA-12486 was introduced in 3.1.0, which is the version I 
started noticing problems. I see you helped out with that Jira, [~ableegoldman] 
, is there any possible way in your mind that it might cause weirdness with 
state restoration? 


was (Author: gray.john):
I know next to nothing about the internal workings of Kafka, sadly, but I am 
noticing that KAFKA-12486 was introduced in 3.1.0, which is the version I 
started noticing problems. I notice you helped out with that Jira, 
[~ableegoldman] , is there any possible way in your mind that it might cause 
weirdness with state restoration? 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599002#comment-17599002
 ] 

John Gray edited comment on KAFKA-14172 at 9/1/22 2:44 PM:
---

I know next to nothing about the internal workings of Kafka, sadly, but I am 
noticing that KAFKA-12486 was introduced in 3.1.0, which is the version I 
started noticing problems. I notice you helped out with that Jira, 
[~ableegoldman] , is there any possible way in your mind that it might cause 
weirdness with state restoration? 


was (Author: gray.john):
I know next to nothing about the internal workings of Kafka, sadly, but I am 
noticing that KAFKA-12486 was introduced in 3.1.0, which is the version I 
started noticing problems. I notice you helped out with that Jira, 
[~ableegoldman] , is there any possible way in your mind it might cause 
weirdness with state restoration? 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598956#comment-17598956
 ] 

John Gray edited comment on KAFKA-14172 at 9/1/22 2:41 PM:
---

My stateful/EOS Kafka apps also seem to be struggling on 3.1.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, it seems to trigger a rebalance in our stateful apps, 
and then we lose data. We do not have the extra disk space for standby 
replicas, so the acceptable.recovery.lag and related bits to the standby 
replicas are not at play for us. But the restore consumers fumbling data w/ EOS 
seems to be a big problem for us as well. 


was (Author: gray.john):
My stateful/EOS Kafka apps also seem to be struggling on 3.1.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us as 
well. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599002#comment-17599002
 ] 

John Gray commented on KAFKA-14172:
---

I know next to nothing about the internal workings of Kafka, sadly, but I am 
noticing that KAFKA-12486 was introduced in 3.1.0, which is the version I 
started noticing problems. I notice you helped out with that Jira, 
[~ableegoldman] , is there any possible way in your mind it might cause 
weirdness with state restoration? 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598956#comment-17598956
 ] 

John Gray edited comment on KAFKA-14172 at 9/1/22 1:29 PM:
---

My stateful/EOS Kafka apps also seem to be struggling on 3.1.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us as 
well. 


was (Author: gray.john):
My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us as 
well. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598956#comment-17598956
 ] 

John Gray edited comment on KAFKA-14172 at 9/1/22 1:12 PM:
---

My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us as 
well. 


was (Author: gray.john):
My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2022-09-01 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598956#comment-17598956
 ] 

John Gray commented on KAFKA-14172:
---

My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar 
theme: it appears the restore consumers are not consuming all of their messages 
for a full restore before processing begins. This sad situation seems to happen 
consistently after Strimzi rolls out an upgrade to our cluster. Once the 
brokers are all rolled, if our stateful apps rebalance, we lose data. We do not 
have the extra disk space for standby replicas, so the acceptable.recovery.lag 
and related bits to the standby replicas are not at play for us. But the 
restore consumers fumbling data w/ EOS seems to be a big problem for us. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13335) Upgrading connect from 2.7.0 to 2.8.0 causes worker instability

2022-06-07 Thread John Gray (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray resolved KAFKA-13335.
---
Resolution: Not A Problem

Finally got back to this after a long time. This is no bug or fault of Kafka 
Connect. We have a lot of connectors, so it takes a while to rebalance all of 
them. We were simply constantly hitting the rebalance.timeout.ms, leaving us in 
an endless loop of rebalancing. Not sure what changed between 2.7.0 and 2.8.0 
to enforce this timeout or to lengthen the time to rebalance, but something 
did. Bumped the timeout to 3 minutes from 1 minute and we are good to go! 

> Upgrading connect from 2.7.0 to 2.8.0 causes worker instability
> ---
>
> Key: KAFKA-13335
> URL: https://issues.apache.org/jira/browse/KAFKA-13335
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.8.0
>Reporter: John Gray
>Priority: Major
> Attachments: image-2021-09-29-09-15-18-172.png
>
>
> After recently upgrading our connect cluster to 2.8.0 (via 
> strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the 
> cluster is struggling to stabilize. Connectors are being 
> unassigned/reassigned/duplicated continuously, and never settling back down. 
> A downgrade back to 2.7.0 fixes things immediately. I have attached a picture 
> of our Grafana dashboards showing some metrics. We have a connect cluster 
> with 4 nodes, trying to maintain about 1000 connectors, each connector with a 
> maxTask of 1. 
> We are noticing a slow increase in memory usage with big random peaks of 
> tasks counts and thread counts.
> I do also notice over the course of letting 2.8.0 run a huge increase in logs 
> stating that {code}ERROR Graceful stop of task (task name here) 
> failed.{code}, but the logs do not seem to indicate a reason. The connector 
> appears to be stopped only seconds after its creation. It appears to only 
> affect our source connectors. These logs stop after downgrading back to 2.7.0.
> I am also seeing an increase in logs stating that {code}Couldn't instantiate 
> task (task name) because it has an invalid task configuration. This task will 
> not execute until reconfigured. 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
> [StartAndStopExecutor-connect-1-1]
> org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
> worker: (task name)
>   at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834){code}
> I am not sure what could be causing this, any insight would be appreciated! 
> I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
> (KAFKA-10413). Is that fix potentially causing instability? 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-13335) Upgrading connect from 2.7.0 to 2.8.0 causes worker instability

2021-09-29 Thread John Gray (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray updated KAFKA-13335:
--
Description: 
After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am also seeing an increase in logs stating that {code}Couldn't instantiate 
task (task name) because it has an invalid task configuration. This task will 
not execute until reconfigured. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
[StartAndStopExecutor-connect-1-1]
org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
worker: (task name)
at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834){code}

I am not sure what could be causing this, any insight would be appreciated! 
I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
(KAFKA-10413). Is that fix potentially causing instability? 

  was:
After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am also seeing an increase in logs stating that {code}Couldn't instantiate 
task (source task name) because it has an invalid task configuration. This task 
will not execute until reconfigured. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
[StartAndStopExecutor-connect-1-1]
org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
worker: (source task name)
at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834){code}

I am not 

[jira] [Updated] (KAFKA-13335) Upgrading connect from 2.7.0 to 2.8.0 causes worker instability

2021-09-29 Thread John Gray (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray updated KAFKA-13335:
--
Description: 
After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am also seeing an increase in logs stating that {code}Couldn't instantiate 
task (source task name) because it has an invalid task configuration. This task 
will not execute until reconfigured. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
[StartAndStopExecutor-connect-1-1]
org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
worker: (source task name)
at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834){code}

I am not sure what could be causing this, any insight would be appreciated! 
I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
(KAFKA-10413). Is that fix potentially causing instability? 

  was:
After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am not sure what could be causing this, any insight would be appreciated! 
I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
(KAFKA-10413). Is that fix potentially causing instability? 


> Upgrading connect from 2.7.0 to 2.8.0 causes worker instability
> ---
>
> Key: KAFKA-13335
> URL: https://issues.apache.org/jira/browse/KAFKA-13335
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.8.0
>Reporter: John Gray
>Priority: Major
> Attachments: image-2021-09-29-09-15-18-172.png
>
>
> After recently upgrading our connect cluster to 2.8.0 (via 
> strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the 
> cluster is struggling to stabilize. Connectors are being 
> unassigned/reassigned/duplicated continuously, and never settling back down. 
> A downgrade back to 2.7.0 fixes things immediately. I have attached a picture 
> of our Grafana dashboards showing some metrics. We have a connect cluster 
> with 4 nodes, trying to maintain about 1000 connectors, each connector with a 
> maxTask of 1. 
> We are noticing a slow increase in 

[jira] [Created] (KAFKA-13335) Upgrading connect from 2.7.0 to 2.8.0 causes worker instability

2021-09-29 Thread John Gray (Jira)
John Gray created KAFKA-13335:
-

 Summary: Upgrading connect from 2.7.0 to 2.8.0 causes worker 
instability
 Key: KAFKA-13335
 URL: https://issues.apache.org/jira/browse/KAFKA-13335
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Affects Versions: 2.8.0
Reporter: John Gray
 Attachments: image-2021-09-29-09-15-18-172.png

After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am not sure what could be causing this, any insight would be appreciated! 
I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
(KAFKA-10413). Is that fix potentially causing instability? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10643) Static membership - repetitive PreparingRebalance with updating metadata for member reason

2021-09-09 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412602#comment-17412602
 ] 

John Gray commented on KAFKA-10643:
---

[~maatdeamon] hey your understanding appears about as good as mine: if we 
restore for more time than metadata.max.age.ms, the broker forces a metadata 
update, which appears to cause a rebalance for static consumers. Which causes 
the consumers to restart the restore, and the cycle repeats. I am not sure 
there is any bug here, just a sneaky lil' config that needs to be noticed. 
Another issue we ran into was that after the consumer restores, the consumer 
would immediately be kicked out of the group because of how long it went 
without polling, which is why we also had to bump max.poll.interval.ms up.

> Static membership - repetitive PreparingRebalance with updating metadata for 
> member reason
> --
>
> Key: KAFKA-10643
> URL: https://issues.apache.org/jira/browse/KAFKA-10643
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.0
>Reporter: Eran Levy
>Priority: Major
> Attachments: broker-4-11.csv, client-4-11.csv, 
> client-d-9-11-11-2020.csv
>
>
> Kafka streams 2.6.0, brokers version 2.6.0. Kafka nodes are healthy, kafka 
> streams app is healthy. 
> Configured with static membership. 
> Every 10 minutes (I assume cause of topic.metadata.refresh.interval.ms), I 
> see the following group coordinator log for different stream consumers: 
> INFO [GroupCoordinator 2]: Preparing to rebalance group **--**-stream in 
> state PreparingRebalance with old generation 12244 (__consumer_offsets-45) 
> (reason: Updating metadata for member 
> -stream-11-1-013edd56-ed93-4370-b07c-1c29fbe72c9a) 
> (kafka.coordinator.group.GroupCoordinator)
> and right after that the following log: 
> INFO [GroupCoordinator 2]: Assignment received from leader for group 
> **-**-stream for generation 12246 (kafka.coordinator.group.GroupCoordinator)
>  
> Looked a bit on the kafka code and Im not sure that I get why such a thing 
> happening - is this line described the situation that happens here re the 
> "reason:"?[https://github.com/apache/kafka/blob/7ca299b8c0f2f3256c40b694078e422350c20d19/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L311]
> I also dont see it happening too often in other kafka streams applications 
> that we have. 
> The only thing suspicious that I see around every hour that different pods of 
> that kafka streams application throw this exception: 
> {"timestamp":"2020-10-25T06:44:20.414Z","level":"INFO","thread":"**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1","logger":"org.apache.kafka.clients.FetchSessionHandler","message":"[Consumer
>  
> clientId=**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1-restore-consumer,
>  groupId=null] Error sending fetch request (sessionId=34683236, epoch=2872) 
> to node 
> 3:","context":"default","exception":"org.apache.kafka.common.errors.DisconnectException:
>  null\n"}
> I came across this strange behaviour after stated to investigate a strange 
> stuck rebalancing state after one of the members left the group and caused 
> the rebalance to stuck - the only thing that I found is that maybe because 
> that too often preparing to rebalance states, the app might affected of this 
> bug - KAFKA-9752 ?
> I dont understand why it happens, it wasn't before I applied static 
> membership to that kafka streams application (since around 2 weeks ago). 
> Will be happy if you can help me
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-10643) Static membership - repetitive PreparingRebalance with updating metadata for member reason

2021-09-08 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411926#comment-17411926
 ] 

John Gray edited comment on KAFKA-10643 at 9/8/21, 2:53 PM:


We were having this same issue with our new static consumers once their 
changelog topics got large enough. The group would never stabilize because of 
these looping metadata updates. We ended up stabilizing our groups by 
increasing max.poll.interval.ms and metadata.max.age.ms in our streams apps to 
longer than however long we expected our restore consumer to take restoring our 
large stores. 30 minutes ended up working for us. I am not sure if it is 
expected that a metadata update should trigger a rebalance for a static 
consumer group with lots of restoring threads, but it certainly sent our groups 
with large state into a frenzy. It has been a while so you may have moved on 
from this, but I would be curious to see if these configs help your group, 
[~maatdeamon].


was (Author: gray.john):
We were having this same issue with our new static consumers once their 
changelog topics got large enough. The group would never stabilize because of 
these looping metadata updates. We ended up stabilizing our groups by 
increasing max.poll.record.ms and metadata.max.age.ms in our streams apps to 
longer than however long we expected our restore consumer to take restoring our 
large stores. 30 minutes ended up working for us. I am not sure if it is 
expected that a metadata update should trigger a rebalance for a static 
consumer group with lots of restoring threads, but it certainly sent our groups 
with large state into a frenzy. It has been a while so you may have moved on 
from this, but I would be curious to see if these configs help your group, 
[~maatdeamon].

> Static membership - repetitive PreparingRebalance with updating metadata for 
> member reason
> --
>
> Key: KAFKA-10643
> URL: https://issues.apache.org/jira/browse/KAFKA-10643
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.0
>Reporter: Eran Levy
>Priority: Major
> Attachments: broker-4-11.csv, client-4-11.csv, 
> client-d-9-11-11-2020.csv
>
>
> Kafka streams 2.6.0, brokers version 2.6.0. Kafka nodes are healthy, kafka 
> streams app is healthy. 
> Configured with static membership. 
> Every 10 minutes (I assume cause of topic.metadata.refresh.interval.ms), I 
> see the following group coordinator log for different stream consumers: 
> INFO [GroupCoordinator 2]: Preparing to rebalance group **--**-stream in 
> state PreparingRebalance with old generation 12244 (__consumer_offsets-45) 
> (reason: Updating metadata for member 
> -stream-11-1-013edd56-ed93-4370-b07c-1c29fbe72c9a) 
> (kafka.coordinator.group.GroupCoordinator)
> and right after that the following log: 
> INFO [GroupCoordinator 2]: Assignment received from leader for group 
> **-**-stream for generation 12246 (kafka.coordinator.group.GroupCoordinator)
>  
> Looked a bit on the kafka code and Im not sure that I get why such a thing 
> happening - is this line described the situation that happens here re the 
> "reason:"?[https://github.com/apache/kafka/blob/7ca299b8c0f2f3256c40b694078e422350c20d19/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L311]
> I also dont see it happening too often in other kafka streams applications 
> that we have. 
> The only thing suspicious that I see around every hour that different pods of 
> that kafka streams application throw this exception: 
> {"timestamp":"2020-10-25T06:44:20.414Z","level":"INFO","thread":"**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1","logger":"org.apache.kafka.clients.FetchSessionHandler","message":"[Consumer
>  
> clientId=**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1-restore-consumer,
>  groupId=null] Error sending fetch request (sessionId=34683236, epoch=2872) 
> to node 
> 3:","context":"default","exception":"org.apache.kafka.common.errors.DisconnectException:
>  null\n"}
> I came across this strange behaviour after stated to investigate a strange 
> stuck rebalancing state after one of the members left the group and caused 
> the rebalance to stuck - the only thing that I found is that maybe because 
> that too often preparing to rebalance states, the app might affected of this 
> bug - KAFKA-9752 ?
> I dont understand why it happens, it wasn't before I applied static 
> membership to that kafka streams application (since around 2 weeks ago). 
> Will be happy if you can help me
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10643) Static membership - repetitive PreparingRebalance with updating metadata for member reason

2021-09-08 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411982#comment-17411982
 ] 

John Gray commented on KAFKA-10643:
---

[~maatdeamon] we did set both max.poll.interval.ms and metadata.max.age.ms 
equal at 30minutes, but I think the key is that it be larger than however long 
your state stores take to restore. Assuming you have state stores to restore. 
We could certainly be running into similar but different problems.

> Static membership - repetitive PreparingRebalance with updating metadata for 
> member reason
> --
>
> Key: KAFKA-10643
> URL: https://issues.apache.org/jira/browse/KAFKA-10643
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.0
>Reporter: Eran Levy
>Priority: Major
> Attachments: broker-4-11.csv, client-4-11.csv, 
> client-d-9-11-11-2020.csv
>
>
> Kafka streams 2.6.0, brokers version 2.6.0. Kafka nodes are healthy, kafka 
> streams app is healthy. 
> Configured with static membership. 
> Every 10 minutes (I assume cause of topic.metadata.refresh.interval.ms), I 
> see the following group coordinator log for different stream consumers: 
> INFO [GroupCoordinator 2]: Preparing to rebalance group **--**-stream in 
> state PreparingRebalance with old generation 12244 (__consumer_offsets-45) 
> (reason: Updating metadata for member 
> -stream-11-1-013edd56-ed93-4370-b07c-1c29fbe72c9a) 
> (kafka.coordinator.group.GroupCoordinator)
> and right after that the following log: 
> INFO [GroupCoordinator 2]: Assignment received from leader for group 
> **-**-stream for generation 12246 (kafka.coordinator.group.GroupCoordinator)
>  
> Looked a bit on the kafka code and Im not sure that I get why such a thing 
> happening - is this line described the situation that happens here re the 
> "reason:"?[https://github.com/apache/kafka/blob/7ca299b8c0f2f3256c40b694078e422350c20d19/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L311]
> I also dont see it happening too often in other kafka streams applications 
> that we have. 
> The only thing suspicious that I see around every hour that different pods of 
> that kafka streams application throw this exception: 
> {"timestamp":"2020-10-25T06:44:20.414Z","level":"INFO","thread":"**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1","logger":"org.apache.kafka.clients.FetchSessionHandler","message":"[Consumer
>  
> clientId=**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1-restore-consumer,
>  groupId=null] Error sending fetch request (sessionId=34683236, epoch=2872) 
> to node 
> 3:","context":"default","exception":"org.apache.kafka.common.errors.DisconnectException:
>  null\n"}
> I came across this strange behaviour after stated to investigate a strange 
> stuck rebalancing state after one of the members left the group and caused 
> the rebalance to stuck - the only thing that I found is that maybe because 
> that too often preparing to rebalance states, the app might affected of this 
> bug - KAFKA-9752 ?
> I dont understand why it happens, it wasn't before I applied static 
> membership to that kafka streams application (since around 2 weeks ago). 
> Will be happy if you can help me
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-10643) Static membership - repetitive PreparingRebalance with updating metadata for member reason

2021-09-08 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411926#comment-17411926
 ] 

John Gray edited comment on KAFKA-10643 at 9/8/21, 1:23 PM:


We were having this same issue with our new static consumers once their 
changelog topics got large enough. The group would never stabilize because of 
these looping metadata updates. We ended up stabilizing our groups by 
increasing max.poll.record.ms and metadata.max.age.ms in our streams apps to 
longer than however long we expected our restore consumer to take restoring our 
large stores. 30 minutes ended up working for us. I am not sure if it is 
expected that a metadata update should trigger a rebalance for a static 
consumer group with lots of restoring threads, but it certainly sent our groups 
with large state into a frenzy. It has been a while so you may have moved on 
from this, but I would be curious to see if these configs help your group, 
[~maatdeamon].


was (Author: gray.john):
We were having this same issue with our new static consumers once their 
changelog topics got large enough. The group would never stabilize because of 
these looping metadata updates. We ended up stabilizing our groups by 
increasing max.poll.record.ms and metadata.max.age.ms in our streams apps to 
longer than however long we expected our restore consumer to take restoring our 
large stores. 30 minutes ended up working for us. I am not sure if this is 
expected that a metadata update should trigger a rebalance for a static 
consumer group with lots of restoring threads, but it certainly sent our groups 
with large state into a frenzy. It has been a while so you may have moved on 
from this, but I would be curious to see if these configs help your group, 
[~maatdeamon].

> Static membership - repetitive PreparingRebalance with updating metadata for 
> member reason
> --
>
> Key: KAFKA-10643
> URL: https://issues.apache.org/jira/browse/KAFKA-10643
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.0
>Reporter: Eran Levy
>Priority: Major
> Attachments: broker-4-11.csv, client-4-11.csv, 
> client-d-9-11-11-2020.csv
>
>
> Kafka streams 2.6.0, brokers version 2.6.0. Kafka nodes are healthy, kafka 
> streams app is healthy. 
> Configured with static membership. 
> Every 10 minutes (I assume cause of topic.metadata.refresh.interval.ms), I 
> see the following group coordinator log for different stream consumers: 
> INFO [GroupCoordinator 2]: Preparing to rebalance group **--**-stream in 
> state PreparingRebalance with old generation 12244 (__consumer_offsets-45) 
> (reason: Updating metadata for member 
> -stream-11-1-013edd56-ed93-4370-b07c-1c29fbe72c9a) 
> (kafka.coordinator.group.GroupCoordinator)
> and right after that the following log: 
> INFO [GroupCoordinator 2]: Assignment received from leader for group 
> **-**-stream for generation 12246 (kafka.coordinator.group.GroupCoordinator)
>  
> Looked a bit on the kafka code and Im not sure that I get why such a thing 
> happening - is this line described the situation that happens here re the 
> "reason:"?[https://github.com/apache/kafka/blob/7ca299b8c0f2f3256c40b694078e422350c20d19/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L311]
> I also dont see it happening too often in other kafka streams applications 
> that we have. 
> The only thing suspicious that I see around every hour that different pods of 
> that kafka streams application throw this exception: 
> {"timestamp":"2020-10-25T06:44:20.414Z","level":"INFO","thread":"**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1","logger":"org.apache.kafka.clients.FetchSessionHandler","message":"[Consumer
>  
> clientId=**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1-restore-consumer,
>  groupId=null] Error sending fetch request (sessionId=34683236, epoch=2872) 
> to node 
> 3:","context":"default","exception":"org.apache.kafka.common.errors.DisconnectException:
>  null\n"}
> I came across this strange behaviour after stated to investigate a strange 
> stuck rebalancing state after one of the members left the group and caused 
> the rebalance to stuck - the only thing that I found is that maybe because 
> that too often preparing to rebalance states, the app might affected of this 
> bug - KAFKA-9752 ?
> I dont understand why it happens, it wasn't before I applied static 
> membership to that kafka streams application (since around 2 weeks ago). 
> Will be happy if you can help me
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10643) Static membership - repetitive PreparingRebalance with updating metadata for member reason

2021-09-08 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411926#comment-17411926
 ] 

John Gray commented on KAFKA-10643:
---

We were having this same issue with our new static consumers once their 
changelog topics got large enough. The group would never stabilize because of 
these looping metadata updates. We ended up stabilizing our groups by 
increasing max.poll.record.ms and metadata.max.age.ms in our streams apps to 
longer than however long we expected our restore consumer to take restoring our 
large stores. 30 minutes ended up working for us. I am not sure if this is 
expected that a metadata update should trigger a rebalance for a static 
consumer group with lots of restoring threads, but it certainly sent our groups 
with large state into a frenzy. It has been a while so you may have moved on 
from this, but I would be curious to see if these configs help your group, 
[~maatdeamon].

> Static membership - repetitive PreparingRebalance with updating metadata for 
> member reason
> --
>
> Key: KAFKA-10643
> URL: https://issues.apache.org/jira/browse/KAFKA-10643
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.0
>Reporter: Eran Levy
>Priority: Major
> Attachments: broker-4-11.csv, client-4-11.csv, 
> client-d-9-11-11-2020.csv
>
>
> Kafka streams 2.6.0, brokers version 2.6.0. Kafka nodes are healthy, kafka 
> streams app is healthy. 
> Configured with static membership. 
> Every 10 minutes (I assume cause of topic.metadata.refresh.interval.ms), I 
> see the following group coordinator log for different stream consumers: 
> INFO [GroupCoordinator 2]: Preparing to rebalance group **--**-stream in 
> state PreparingRebalance with old generation 12244 (__consumer_offsets-45) 
> (reason: Updating metadata for member 
> -stream-11-1-013edd56-ed93-4370-b07c-1c29fbe72c9a) 
> (kafka.coordinator.group.GroupCoordinator)
> and right after that the following log: 
> INFO [GroupCoordinator 2]: Assignment received from leader for group 
> **-**-stream for generation 12246 (kafka.coordinator.group.GroupCoordinator)
>  
> Looked a bit on the kafka code and Im not sure that I get why such a thing 
> happening - is this line described the situation that happens here re the 
> "reason:"?[https://github.com/apache/kafka/blob/7ca299b8c0f2f3256c40b694078e422350c20d19/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L311]
> I also dont see it happening too often in other kafka streams applications 
> that we have. 
> The only thing suspicious that I see around every hour that different pods of 
> that kafka streams application throw this exception: 
> {"timestamp":"2020-10-25T06:44:20.414Z","level":"INFO","thread":"**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1","logger":"org.apache.kafka.clients.FetchSessionHandler","message":"[Consumer
>  
> clientId=**-**-stream-94561945-4191-4a07-ac1b-07b27e044402-StreamThread-1-restore-consumer,
>  groupId=null] Error sending fetch request (sessionId=34683236, epoch=2872) 
> to node 
> 3:","context":"default","exception":"org.apache.kafka.common.errors.DisconnectException:
>  null\n"}
> I came across this strange behaviour after stated to investigate a strange 
> stuck rebalancing state after one of the members left the group and caused 
> the rebalance to stuck - the only thing that I found is that maybe because 
> that too often preparing to rebalance states, the app might affected of this 
> bug - KAFKA-9752 ?
> I dont understand why it happens, it wasn't before I applied static 
> membership to that kafka streams application (since around 2 weeks ago). 
> Will be happy if you can help me
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-14 Thread John Gray (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380543#comment-17380543
 ] 

John Gray commented on KAFKA-13037:
---

Awesome! Thank you, Sophie.

> "Thread state is already PENDING_SHUTDOWN" log spam
> ---
>
> Key: KAFKA-13037
> URL: https://issues.apache.org/jira/browse/KAFKA-13037
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: John Gray
>Assignee: John Gray
>Priority: Major
> Fix For: 3.0.0, 2.8.1
>
>
> KAFKA-12462 introduced a 
> [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
>  that increased this "Thread state is already {}" logger to info instead of 
> debug. We are running into a problem with our streams apps when they hit an 
> unrecoverable exception that shuts down the streams thread, we get this log 
> printed about 50,000 times per second per thread. I am guessing it is once 
> per record we have queued up when the exception happens? We have temporarily 
> raised the StreamThread logger to WARN instead of INFO to avoid the spam, but 
> we do miss the other good logs we get on INFO in that class. Could this log 
> be reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-06 Thread John Gray (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray updated KAFKA-13037:
--
Description: KAFKA-12462 introduced a 
[change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
 that increased this "Thread state is already {}" logger to info instead of 
debug. We are running into a problem with our streams apps when they hit an 
unrecoverable exception that shuts down the streams thread, we get this log 
printed about 50,000 times per second per thread. I am guessing it is once per 
record we have queued up when the exception happens? We have temporarily raised 
the StreamThread logger to WARN instead of INFO to avoid the spam, but we do 
miss the other good logs we get on INFO in that class. Could this log be 
reverted back to debug? Thank you!   (was: KAFKA-12462 introduced a 
[change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
 that increased this "Thread state is already {}" logger to info instead of 
debug. We are running into a problem with our streams apps when they hit an 
unrecoverable exception that shuts down the streams thread, we get this log 
printed about 50,000 times per second per thread. I am guessing it is once per 
record we have queued up when the exception happens. We have temporarily raised 
the StreamThread logger to WARN instead of INFO to avoid the spam, but we do 
miss the other good logs we get on INFO in that class. Could this log be 
reverted back to debug? Thank you! )

> "Thread state is already PENDING_SHUTDOWN" log spam
> ---
>
> Key: KAFKA-13037
> URL: https://issues.apache.org/jira/browse/KAFKA-13037
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: John Gray
>Priority: Major
>
> KAFKA-12462 introduced a 
> [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
>  that increased this "Thread state is already {}" logger to info instead of 
> debug. We are running into a problem with our streams apps when they hit an 
> unrecoverable exception that shuts down the streams thread, we get this log 
> printed about 50,000 times per second per thread. I am guessing it is once 
> per record we have queued up when the exception happens? We have temporarily 
> raised the StreamThread logger to WARN instead of INFO to avoid the spam, but 
> we do miss the other good logs we get on INFO in that class. Could this log 
> be reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-06 Thread John Gray (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray updated KAFKA-13037:
--
Description: KAFKA-12462 introduced a 
[change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
 that increased this "Thread state is already {}" logger to info instead of 
debug. We are running into a problem with our streams apps when they hit an 
unrecoverable exception that shuts down the streams thread, we get this log 
printed about 50,000 times per second per thread. I am guessing it is once per 
record we have queued up when the exception happens. We have temporarily raised 
the StreamThread logger to WARN instead of INFO to avoid the spam, but we do 
miss the other good logs we get on INFO in that class. Could this log be 
reverted back to debug? Thank you!   (was: KAFKA-12462 introduced a 
[change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
 that increased this "Thread state is already {}" logger to info instead of 
debug. We are running into a problem with our streams apps that when they hit 
an unrecoverable exception that shuts down the streams thread, we get this log 
printed about 50,000 times per second per thread. I am guessing it is once per 
record we have queued up when the exception happens. We have temporarily raised 
the StreamThread logger to WARN instead of INFO to avoid the spam, but we do 
miss the other good logs we get on INFO in that class. Could this log be 
reverted back to debug? Thank you! )

> "Thread state is already PENDING_SHUTDOWN" log spam
> ---
>
> Key: KAFKA-13037
> URL: https://issues.apache.org/jira/browse/KAFKA-13037
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: John Gray
>Priority: Major
>
> KAFKA-12462 introduced a 
> [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
>  that increased this "Thread state is already {}" logger to info instead of 
> debug. We are running into a problem with our streams apps when they hit an 
> unrecoverable exception that shuts down the streams thread, we get this log 
> printed about 50,000 times per second per thread. I am guessing it is once 
> per record we have queued up when the exception happens. We have temporarily 
> raised the StreamThread logger to WARN instead of INFO to avoid the spam, but 
> we do miss the other good logs we get on INFO in that class. Could this log 
> be reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-06 Thread John Gray (Jira)
John Gray created KAFKA-13037:
-

 Summary: "Thread state is already PENDING_SHUTDOWN" log spam
 Key: KAFKA-13037
 URL: https://issues.apache.org/jira/browse/KAFKA-13037
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 2.7.1, 2.8.0
Reporter: John Gray


KAFKA-12462 introduced a 
[change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
 that increased this "Thread state is already {}" logger to info instead of 
debug. We are running into a problem with our streams apps that when they hit 
an unrecoverable exception that shuts down the streams thread, we get this log 
printed about 50,000 times per second per thread. I am guessing it is once per 
record we have queued up when the exception happens. We have temporarily raised 
the StreamThread logger to WARN instead of INFO to avoid the spam, but we do 
miss the other good logs we get on INFO in that class. Could this log be 
reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)