Nicholas Feinberg created KAFKA-16779:
-----------------------------------------
Summary: Kafka retains logs past specified retention
Key: KAFKA-16779
URL: https://issues.apache.org/jira/browse/KAFKA-16779
Project: Kafka
Issue Type: Bug
Affects Versions: 3.7.0
Reporter: Nicholas Feinberg
Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz,
kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz,
state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz
In a Kafka cluster with all topics set to four days of retention or longer
(345600000ms), most brokers seem to be retaining six days of data.
This is true even for topics which have high throughput (500MB/s, 50k msgs/s)
and thus are regularly rolling new log segments. We observe this unexpectedly
high retention both via disk usage statistics and by requesting the oldest
available messages from Kafka.
Some of these brokers crashed with an 'mmap failed' error (attached). When
those brokers started up again, they returned to the expected four days of
retention.
Manually restarting brokers also seems to cause them to return to four days of
retention. Demoting and promoting brokers only has this effect on a small part
of the data hosted on a broker.
These hosts had ~170GiB of free memory available. We saw no signs of pressure
on either system or JVM heap memory before or after they reported this error.
Committed memory seems to be around 10%, so this doesn't seem to be an
overcommit issue.
This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th). Prior
to the upgrade, it was running on Kafka 2.4.
We last reduced retention for ops on May 7th, after which we restored retention
to our default of four days. This was the second time we've temporarily reduced
and restored retention since the upgrade. This problem did not manifest the
previous time we did so, nor did it manifest on our other Kafka 3.7 clusters.
We are running on AWS
[d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We
have 23 brokers, each with 24 disks. We're running in a JBOD configuration
(i.e. unraided).
Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD,
we're still using Zookeeper.
Sample broker logs are attached. The 05-12 and 05-14 logs are from separate
hosts. Please let me know if I can provide any further information.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)