Greg Harris created KAFKA-14548:
-----------------------------------

             Summary: Stable streams applications stall due to infrequent 
restoreConsumer polls
                 Key: KAFKA-14548
                 URL: https://issues.apache.org/jira/browse/KAFKA-14548
             Project: Kafka
          Issue Type: Bug
          Components: streams
            Reporter: Greg Harris


We have observed behavior with Streams where otherwise healthy applications 
stall and become unable to process data after a rebalance. The root cause of 
which is that a restoreConsumer can be partitioned from a Kafka cluster with 
stale metadata, while the mainConsumer is healthy with up-to-date metadata. 
This is due to both an issue in streams and an issue in the consumer logic.

In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
while the streams app is running. This consumer is only `poll()`ed when the 
ChangelogReader::restore method is called and at least one changelog is in the 
RESTORING state. This may be very infrequent if the streams app is stable.

This is an anti-pattern, as frequent poll()s are expected to keep kafka 
consumers in contact with the kafka cluster. Infrequent polls are considered 
failures from the perspective of the consumer API. From the [official Kafka 
Consumer 
documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
{noformat}
The poll API is designed to ensure consumer liveness.
...
So to stay in the group, you must continue to call poll.
...
The recommended way to handle these cases [where the main thread is not ready 
for more data] is to move message processing to another thread, which allows 
the consumer to continue calling poll while the processor is still working.
...
Note also that you will need to pause the partition so that no new records are 
received from poll until after thread has finished handling those previously 
returned.{noformat}
With the current behavior, it is expected that the restoreConsumer will fall 
out of the group regularly and be considered failed, when the rest of the 
application is running exactly as intended.

This is not normally an issue, as falling out of the group is easily repaired 
by joining the group during the next poll. It does mean that there is slightly 
higher latency to performing a restore, but that does not appear to be a major 
concern at this time.

This does become an issue when other deeper assumptions about the usage of 
Kafka clients are violated. Relevant to this issue, it is assumed by the client 
metadata management logic that regular polling will take place, and that the 
regular poll call can be piggy-backed to initiate a metadata update. Without a 
regular poll, the regular metadata update cannot be performed, and the consumer 
violates its own `metadata.max.age.ms` configuration. This leads to the 
restoreConsumer having a much older metadata containing none of the currently 
live brokers, partitioning it from the cluster.

Alleviating this failure mode does not _require_ the streams' polling behavior 
to change, as solutions for all clients have been considered 
(https://issues.apache.org/jira/browse/KAFKA-3068 and that family of duplicate 
issues).

However, as a tactical fix for the issue, and one which does not require a KIP 
changing the behavior of {_}every kafka client{_}, we should consider changing 
the restoreConsumer poll behavior to bring it closer to the expected happy-path 
of at least one poll() every poll.interval.ms.

If there is another hidden assumption of the clients that relies on regular 
polling, then this tactical fix may prevent users of the streams library from 
being affected, reducing the impact of that hidden assumption through 
defense-in-depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to