Hi Nicholas,

I didn't know anything in v3.7.0 would cause this issue.
It would be good if you could open a JIRA for it.
Some info to be provided:
1. You said "in the past", what version of Kafka was it using?
2. What is your broker configuration?
3. KRaft mode? Combined mode? (controller + broker node?)
4. There's no much info in the gist link. It would be great if you could
attach the brokers logs for investigation.

Thanks.
Luke


On Wed, May 15, 2024 at 2:46 AM Nicholas Feinberg <nicho...@liftoff.io>
wrote:

> Hello!
>
> We recently upgraded our Kafka cluster to 3.7. This cluster's topics are
> set to have four days of retention (345600000 ms).
>
> In the past, when we've temporarily lowered retention for ops, we've seen
> disk usage return to normal four days later, as expected.
>
> [image: image.png]
>
> However, after our latest round of ops, we're now seeing disk usage
> *continue* to grow on most brokers after those four days pass, despite a 
> *decrease
> *in incoming data. This usage increased until day six.
>
> [image: kafka-ooms.png]
> On day *six* after 4d retention was restored, several brokers began to
> crash, with the following error:
>
> # There is insufficient memory for the Java Runtime Environment to
>> continue.
>> # Native memory allocation (mmap) failed to map 16384 bytes for
>> committing reserved memory.
>
>
> (Details:
> https://gist.github.com/PleasingFungus/3e0cf6b58a4f3eee2171ff91b1aff42a .)
>
> These hosts had ~170GiB of free memory available. We saw no signs of
> pressure on either system or JVM heap memory before or after they reported
> this error. Committed memory seems to be around 10%, so this doesn't seem
> to be an overcommit issue.
>
> The hosts which crashed in this fashion freed large amounts of disk after
> they came back up. This returned them to the usage that we'd expect.
>
> Manually restarting Kafka on a broker likewise resulted in its disk usage
> dropping to the 4d retention level.
>
> Other brokers' disk usage seems to have stabilized.
>
> I've spent some time searching for bugs in the Jira or other posts which
> describe this behavior, but have come up empty.
>
> *Questions*:
>
>    - Has anyone else seen an issue similar to this?
>    - What are some ways that we could confirm whether Kafka is failing to
>    clear expired logs from disk?
>    - What could cause the mmap failures that we saw?
>    - Would it be helpful for us to file a Jira issue or issues for this,
>    and what details should we include if so?
>
> Cheers,
> Nicholas Feinberg
>

Reply via email to