[jira] [Created] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

Greg Harris (Jira) Thu, 22 Dec 2022 11:12:31 -0800

Greg Harris created KAFKA-14548:
-----------------------------------

             Summary: Stable streams applications stall due to infrequent 
restoreConsumer polls
                 Key: KAFKA-14548
                 URL: https://issues.apache.org/jira/browse/KAFKA-14548
             Project: Kafka
          Issue Type: Bug
          Components: streams
            Reporter: Greg Harris

We have observed behavior with Streams where otherwise healthy applications
stall and become unable to process data after a rebalance. The root cause of
which is that a restoreConsumer can be partitioned from a Kafka cluster with
stale metadata, while the mainConsumer is healthy with up-to-date metadata.
This is due to both an issue in streams and an issue in the consumer logic.

In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated
while the streams app is running. This consumer is only `poll()`ed when the
ChangelogReader::restore method is called and at least one changelog is in the
RESTORING state. This may be very infrequent if the streams app is stable.

This is an anti-pattern, as frequent poll()s are expected to keep kafka
consumers in contact with the kafka cluster. Infrequent polls are considered
failures from the perspective of the consumer API. From the [official Kafka
Consumer
documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
{noformat}
The poll API is designed to ensure consumer liveness.
...
So to stay in the group, you must continue to call poll.
...
The recommended way to handle these cases [where the main thread is not ready
for more data] is to move message processing to another thread, which allows
the consumer to continue calling poll while the processor is still working.
...
Note also that you will need to pause the partition so that no new records are
received from poll until after thread has finished handling those previously
returned.{noformat}
With the current behavior, it is expected that the restoreConsumer will fall
out of the group regularly and be considered failed, when the rest of the
application is running exactly as intended.

This is not normally an issue, as falling out of the group is easily repaired
by joining the group during the next poll. It does mean that there is slightly
higher latency to performing a restore, but that does not appear to be a major
concern at this time.

This does become an issue when other deeper assumptions about the usage of
Kafka clients are violated. Relevant to this issue, it is assumed by the client
metadata management logic that regular polling will take place, and that the
regular poll call can be piggy-backed to initiate a metadata update. Without a
regular poll, the regular metadata update cannot be performed, and the consumer
violates its own `metadata.max.age.ms` configuration. This leads to the
restoreConsumer having a much older metadata containing none of the currently
live brokers, partitioning it from the cluster.

Alleviating this failure mode does not _require_ the streams' polling behavior
to change, as solutions for all clients have been considered
(https://issues.apache.org/jira/browse/KAFKA-3068 and that family of duplicate
issues).

However, as a tactical fix for the issue, and one which does not require a KIP
changing the behavior of {_}every kafka client{_}, we should consider changing
the restoreConsumer poll behavior to bring it closer to the expected happy-path
of at least one poll() every poll.interval.ms.

If there is another hidden assumption of the clients that relies on regular
polling, then this tactical fix may prevent users of the streams library from
being affected, reducing the impact of that hidden assumption through
defense-in-depth.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

Reply via email to