Ryan Leslie created KAFKA-12256:
-----------------------------------

             Summary: auto commit causes delays due to retribale 
UNKNOWN_TOPIC_OR_PARTITION
                 Key: KAFKA-12256
                 URL: https://issues.apache.org/jira/browse/KAFKA-12256
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 2.0.0
            Reporter: Ryan Leslie


In KAFKA-6829 a change was made to the consumer to internally retry commits 
upon receiving UNKNOWN_TOPIC_OR_PARTITION.

Though this helped mitigate issues around stale broker metadata, there were 
some valid concerns around the negative effects for routine topic deletion:

https://github.com/apache/kafka/pull/4948

In particular, if a commit is issued for a deleted topic, retries can block the 
consumer for up to max.poll.interval.ms. This is tunable of course, but any 
amount of stalling in a consumer can lead to unnecessary lag.

One of the assumptions while permitting the change was that in practice it 
should be rare for commits to occur for deleted topics, since that would imply 
messages were being read or published at the time of deletion. It's fair to 
expect users to not delete topics that are actively published to. But this 
assumption is false in cases where auto commit is enabled.

With the current implementation of auto commit, the consumer will regularly 
issue commits for all topics being fetched from, regardless of whether or not 
messages were actually received. The fetch positions are simply flushed, even 
when they are 0. This is simple and generally efficient, though it does mean 
commits are often redundant. Besides the auto commit interval, commits are also 
issued at the time of rebalance, which is often precisely at the time topics 
are deleted.

This means that in practice commits for deleted topics are not really rare. 
This is particularly an issue when the consumer is subscribed to a multitude of 
topics using a wildcard. For example, a consumer might subscribe to a 
particular "flavor" of topic with the aim of auditing all such data, and these 
topics might dynamically come and go. The consumer's metadata and rebalance 
mechanisms are meant to handle this gracefully, but the end result is that such 
groups are often blocked in a commit for several seconds or minutes (the 
default is 5 minutes) whenever a delete occurs. This can sometimes result in 
significant lag.

Besides having users abandon auto commit in the face of topic deletes, there 
are probably multiple ways to deal with this, including reconsidering if 
commits still truly need to be retried here, or if this behavior should be more 
configurable; e.g. having a separate commit timeout or policy. In some cases 
the loss of a commit and subsequent message duplication is still preferred to 
processing delays. And having an artificially low max.poll.interval.ms or 
rebalance.timeout.ms comes with its own set of concerns.

In the very least the current behavior and pitfalls around delete with active 
consumers should be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to