[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-27 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652327#comment-17652327
 ] 

Matthias J. Sax commented on KAFKA-14548:
-

Thanks!

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being affected, reducing the impact of that hidden assumption through 
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to 
> the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-27 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652313#comment-17652313
 ] 

Greg Harris commented on KAFKA-14548:
-

I've linked all of the tickets that I know about, including:
 * https://issues.apache.org/jira/browse/KAFKA-1843 earliest
 * https://issues.apache.org/jira/browse/KAFKA-3068 most discussion
 * https://issues.apache.org/jira/browse/KAFKA-8206 where a KIP was discussed 
but never opened
 * https://issues.apache.org/jira/browse/KAFKA-12480 most accurate title

The KIP is not drafted or published as of now.

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being affected, reducing the impact of that hidden assumption through 
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to 
> the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-27 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652292#comment-17652292
 ] 

Matthias J. Sax commented on KAFKA-14548:
-

Thanks! – I have of course an interest to get this addressed. What client 
ticket would need to be tackled? Are they all linked to this ticket? If we 
understand what needs to be done, I am happy to make a case to get this 
prioritized. Also, what KIP did you refer to?

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being affected, reducing the impact of that hidden assumption through 
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to 
> the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-27 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652289#comment-17652289
 ] 

Greg Harris commented on KAFKA-14548:
-

[~mjsax] Thanks for your patience on this issue. I will no longer be pursuing 
this specific change.

I explored the above proposed fix more deeply, and it appears that it is not 
reasonable to add to streams. This is because it is illegal to call poll() on a 
consumer with no active subscriptions, which very often the case for the 
restoreConsumer. In order to implement the workaround into Streams, we would 
have to make use of the `Consumer::partitionsFor` call to force a metadata 
update. The StoreChangelogReader is given the TopicPartitions by the 
ProcessorStateManager, and it does not make sense to change that control flow. 
I also will not suggest a throw-away call to partitionsFor in order to resolve 
this issue, and will instead pursue the clients-based fix.

Closing as duplicate of https://issues.apache.org/jira/browse/KAFKA-13405

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being 

[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-23 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651700#comment-17651700
 ] 

Greg Harris commented on KAFKA-14548:
-

> Note that the JavaDoc you quote is about a consumer that is part of a 
> consumer group. However, the restore consumer is a "stand along" consumer and 
> not part of any group and thus periodic polling is not necessary. There is no 
> consumer group, group management, or heart beating etc.

Yes, I understand that the javadoc is referencing consumer group membership, 
and thanks for pointing out that the restoreConsumer does not make use of a 
consumer group. However, as many if not the majority of consumers do use 
consumer group membership, it is understandable that someone 'forgot' about the 
stand-alone consumers and their use-case when implementing something. I do not 
expect that this metadata refresh bug is the only assumption that will be ever 
made that disproportionately affects the stand-alone consumers. I think that 
the streams library can protect itself from those sorts of bugs that we don't 
even know about yet by conforming to the dominant pattern of use of the 
consumers.

> I am not an expert on the consumer, but I would expect that the restore 
> consumer would refresh its metadata when we use it again if it's cached 
> metadata aged out (for any API call, not just poll()) after a longer pause? 
> Thus, as long as its bootstrap servers are reachable, it should be able to 
> refresh its metadata.

This is not how the consumer has worked, as far as I know, since 2016, when the 
first tickets and discussions about falling back to bootstrap.servers were 
opened. What I described is the current behavior of the consumer, until a KIP 
is passed to change it. This has not received the attention it needed, maybe 
due to people viewing it as more of a failure scenario, rather than something 
that is normally happening on a happy-path execution. When I was reading those 
issues originally, I thought that only a network partition could cause this 
behavior, until I realized it could also be caused by infrequent polling.

> if it's a client issue, we should not put a workaround into Stream to mask 
> the client issue but rather fix the client.

If we were in an incident analysis and using the swiss-cheese-model, both 
clients and streams could be implicated, because either library could have 
taken steps to prevent the outage. Yes, it is sufficient to fix it only in the 
clients, and it is reasonable in the abstract for streams to say it's "not my 
job" to fix it. But in practical terms streams is particularly affected by this 
bug because of its usage patterns, and users of streams are disempowered to fix 
the poll behavior themselves short of forking the project.


I think that we should pursue fixes for both the clients in general and streams 
in particular. We've already requested the first (there's like 8 tickets doing 
so) and this ticket is requesting the latter.

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> 

[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-22 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651440#comment-17651440
 ] 

Matthias J. Sax commented on KAFKA-14548:
-

{quote}This is an anti-pattern, as frequent poll()s are expected to keep kafka 
consumers in contact with the kafka cluster.
{quote}
Well, not really. Note that the JavaDoc you quote is about a consumer that is 
part of a consumer group. However, the restore consumer is a "stand along" 
consumer and not part of any group and thus periodic polling is not necessary. 
There is no consumer group, group management, or heart beating etc.
{quote}Without a regular poll, the regular metadata update cannot be performed, 
and the consumer violates its own `metadata.max.age.ms` configuration. This 
leads to the restoreConsumer having a much older metadata containing none of 
the currently live brokers, partitioning it from the cluster.
{quote}
I am not an expert on the consumer, but I would expect that the restore 
consumer would refresh its metadata when we use it again if it's cached 
metadata aged out (for any API call, not just poll()) after a longer pause? 
Thus, as long as its bootstrap servers are reachable, it should be able to 
refresh its metadata. Back in the days I filed a follow up ticket for the 
clients about cached IPs: https://issues.apache.org/jira/browse/KAFKA-13467 

So far, I still think that there is nothing we can (ie, should) do in Streams – 
if it's a client issue, we should not put a workaround into Stream to mask the 
client issue but rather fix the client.

Thoughts?

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been consider

[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

2022-12-22 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651401#comment-17651401
 ] 

Greg Harris commented on KAFKA-14548:
-

[~mjsax] as you had previously categorized 
https://issues.apache.org/jira/browse/KAFKA-13405 (which has the exact same 
cause and symptoms as this issue) as Not A Bug, do you think that the reasoning 
for the above tactical fix make sense?

> Stable streams applications stall due to infrequent restoreConsumer polls
> -
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Greg Harris
>Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance. The root cause of 
> which is that a restoreConsumer can be partitioned from a Kafka cluster with 
> stale metadata, while the mainConsumer is healthy with up-to-date metadata. 
> This is due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being affected, reducing the impact of that hidden assumption through 
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to 
> the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)