[ https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651440#comment-17651440 ]
Matthias J. Sax commented on KAFKA-14548: ----------------------------------------- {quote}This is an anti-pattern, as frequent poll()s are expected to keep kafka consumers in contact with the kafka cluster. {quote} Well, not really. Note that the JavaDoc you quote is about a consumer that is part of a consumer group. However, the restore consumer is a "stand along" consumer and not part of any group and thus periodic polling is not necessary. There is no consumer group, group management, or heart beating etc. {quote}Without a regular poll, the regular metadata update cannot be performed, and the consumer violates its own `metadata.max.age.ms` configuration. This leads to the restoreConsumer having a much older metadata containing none of the currently live brokers, partitioning it from the cluster. {quote} I am not an expert on the consumer, but I would expect that the restore consumer would refresh its metadata when we use it again if it's cached metadata aged out (for any API call, not just poll()) after a longer pause? Thus, as long as its bootstrap servers are reachable, it should be able to refresh its metadata. Back in the days I filed a follow up ticket for the clients about cached IPs: https://issues.apache.org/jira/browse/KAFKA-13467 So far, I still think that there is nothing we can (ie, should) do in Streams – if it's a client issue, we should not put a workaround into Stream to mask the client issue but rather fix the client. Thoughts? > Stable streams applications stall due to infrequent restoreConsumer polls > ------------------------------------------------------------------------- > > Key: KAFKA-14548 > URL: https://issues.apache.org/jira/browse/KAFKA-14548 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: Greg Harris > Priority: Major > > We have observed behavior with Streams where otherwise healthy applications > stall and become unable to process data after a rebalance > (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which > is that a restoreConsumer can be partitioned from a Kafka cluster with stale > metadata, while the mainConsumer is healthy with up-to-date metadata. This is > due to both an issue in streams and an issue in the consumer logic. > In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated > while the streams app is running. This consumer is only `poll()`ed when the > ChangelogReader::restore method is called and at least one changelog is in > the RESTORING state. This may be very infrequent if the streams app is stable. > This is an anti-pattern, as frequent poll()s are expected to keep kafka > consumers in contact with the kafka cluster. Infrequent polls are considered > failures from the perspective of the consumer API. From the [official Kafka > Consumer > documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]: > {noformat} > The poll API is designed to ensure consumer liveness. > ... > So to stay in the group, you must continue to call poll. > ... > The recommended way to handle these cases [where the main thread is not ready > for more data] is to move message processing to another thread, which allows > the consumer to continue calling poll while the processor is still working. > ... > Note also that you will need to pause the partition so that no new records > are received from poll until after thread has finished handling those > previously returned.{noformat} > With the current behavior, it is expected that the restoreConsumer will fall > out of the group regularly and be considered failed, when the rest of the > application is running exactly as intended. > This is not normally an issue, as falling out of the group is easily repaired > by joining the group during the next poll. It does mean that there is > slightly higher latency to performing a restore, but that does not appear to > be a major concern at this time. > This does become an issue when other deeper assumptions about the usage of > Kafka clients are violated. Relevant to this issue, it is assumed by the > client metadata management logic that regular polling will take place, and > that the regular poll call can be piggy-backed to initiate a metadata update. > Without a regular poll, the regular metadata update cannot be performed, and > the consumer violates its own `metadata.max.age.ms` configuration. This leads > to the restoreConsumer having a much older metadata containing none of the > currently live brokers, partitioning it from the cluster. > Alleviating this failure mode does not _require_ the streams' polling > behavior to change, as solutions for all clients have been considered > (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of > duplicate issues). > However, as a tactical fix for the issue, and one which does not require a > KIP changing the behavior of {_}every kafka client{_}, we should consider > changing the restoreConsumer poll behavior to bring it closer to the expected > happy-path of at least one poll() every poll.interval.ms. > If there is another hidden assumption of the clients that relies on regular > polling, then this tactical fix may prevent users of the streams library from > being affected, reducing the impact of that hidden assumption through > defense-in-depth. > This would also be a backport-able fix for streams users, instead of a fix to > the consumers which would only apply to new versions of the consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010)