[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651440#comment-17651440
 ] 

Matthias J. Sax commented on KAFKA-14548:
-----------------------------------------

{quote}This is an anti-pattern, as frequent poll()s are expected to keep kafka 
consumers in contact with the kafka cluster.
{quote}
Well, not really. Note that the JavaDoc you quote is about a consumer that is 
part of a consumer group. However, the restore consumer is a "stand along" 
consumer and not part of any group and thus periodic polling is not necessary. 
There is no consumer group, group management, or heart beating etc.
{quote}Without a regular poll, the regular metadata update cannot be performed, 
and the consumer violates its own `metadata.max.age.ms` configuration. This 
leads to the restoreConsumer having a much older metadata containing none of 
the currently live brokers, partitioning it from the cluster.
{quote}
I am not an expert on the consumer, but I would expect that the restore 
consumer would refresh its metadata when we use it again if it's cached 
metadata aged out (for any API call, not just poll()) after a longer pause? 
Thus, as long as its bootstrap servers are reachable, it should be able to 
refresh its metadata. Back in the days I filed a follow up ticket for the 
clients about cached IPs: https://issues.apache.org/jira/browse/KAFKA-13467 

So far, I still think that there is nothing we can (ie, should) do in Streams – 
if it's a client issue, we should not put a workaround into Stream to mask the 
client issue but rather fix the client.

Thoughts?

> Stable streams applications stall due to infrequent restoreConsumer polls
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-14548
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14548
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Greg Harris
>            Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being affected, reducing the impact of that hidden assumption through 
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to 
> the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to