Ryan Leslie created KAFKA-12256:
-----------------------------------
Summary: auto commit causes delays due to retribale
UNKNOWN_TOPIC_OR_PARTITION
Key: KAFKA-12256
URL: https://issues.apache.org/jira/browse/KAFKA-12256
Project: Kafka
Issue Type: Bug
Components: consumer
Affects Versions: 2.0.0
Reporter: Ryan Leslie
In KAFKA-6829 a change was made to the consumer to internally retry commits
upon receiving UNKNOWN_TOPIC_OR_PARTITION.
Though this helped mitigate issues around stale broker metadata, there were
some valid concerns around the negative effects for routine topic deletion:
https://github.com/apache/kafka/pull/4948
In particular, if a commit is issued for a deleted topic, retries can block the
consumer for up to max.poll.interval.ms. This is tunable of course, but any
amount of stalling in a consumer can lead to unnecessary lag.
One of the assumptions while permitting the change was that in practice it
should be rare for commits to occur for deleted topics, since that would imply
messages were being read or published at the time of deletion. It's fair to
expect users to not delete topics that are actively published to. But this
assumption is false in cases where auto commit is enabled.
With the current implementation of auto commit, the consumer will regularly
issue commits for all topics being fetched from, regardless of whether or not
messages were actually received. The fetch positions are simply flushed, even
when they are 0. This is simple and generally efficient, though it does mean
commits are often redundant. Besides the auto commit interval, commits are also
issued at the time of rebalance, which is often precisely at the time topics
are deleted.
This means that in practice commits for deleted topics are not really rare.
This is particularly an issue when the consumer is subscribed to a multitude of
topics using a wildcard. For example, a consumer might subscribe to a
particular "flavor" of topic with the aim of auditing all such data, and these
topics might dynamically come and go. The consumer's metadata and rebalance
mechanisms are meant to handle this gracefully, but the end result is that such
groups are often blocked in a commit for several seconds or minutes (the
default is 5 minutes) whenever a delete occurs. This can sometimes result in
significant lag.
Besides having users abandon auto commit in the face of topic deletes, there
are probably multiple ways to deal with this, including reconsidering if
commits still truly need to be retried here, or if this behavior should be more
configurable; e.g. having a separate commit timeout or policy. In some cases
the loss of a commit and subsequent message duplication is still preferred to
processing delays. And having an artificially low max.poll.interval.ms or
rebalance.timeout.ms comes with its own set of concerns.
In the very least the current behavior and pitfalls around delete with active
consumers should be documented.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)