[jira] [Comment Edited] (KAFKA-16779) Kafka retains logs past specified retention

2024-05-21 Thread Nicholas Feinberg (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847997#comment-17847997
 ] 

Nicholas Feinberg edited comment on KAFKA-16779 at 5/21/24 9:41 PM:


When we explicitly set topics' retention to 4d (34560ms), our brokers 
immediately expired the surprisingly old logs. When we removed the setting, 
they began accumulating old logs again.

I've confirmed that the same setting is present in the brokers' 
`server.properties` file - that is, they have `log.retention.hours=96`. I've 
also checked and confirmed that topics do not have an explicitly set retention 
that would override this.


was (Author: nfeinberg):
When we explicitly set topics' retention to 4d (34560ms), our brokers 
immediately expired the surprisingly old logs.

I've confirmed that the same setting is present in the brokers' 
`server.properties` file - that is, they have `log.retention.hours=96`. I've 
also checked and confirmed that topics do not have an explicitly set retention 
that would override this.

> Kafka retains logs past specified retention
> ---
>
> Key: KAFKA-16779
> URL: https://issues.apache.org/jira/browse/KAFKA-16779
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Nicholas Feinberg
>Priority: Major
>  Labels: expiration, retention
> Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz, 
> kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz, 
> state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz
>
>
> In a Kafka cluster with all topics set to four days of retention or longer 
> (34560ms), most brokers seem to be retaining six days of data.
> This is true even for topics which have high throughput (500MB/s, 50k msgs/s) 
> and thus are regularly rolling new log segments. We observe this unexpectedly 
> high retention both via disk usage statistics and by requesting the oldest 
> available messages from Kafka.
> Some of these brokers crashed with an 'mmap failed' error (attached). When 
> those brokers started up again, they returned to the expected four days of 
> retention.
> Manually restarting brokers also seems to cause them to return to four days 
> of retention. Demoting and promoting brokers only has this effect on a small 
> part of the data hosted on a broker.
> These hosts had ~170GiB of free memory available. We saw no signs of pressure 
> on either system or JVM heap memory before or after they reported this error. 
> Committed memory seems to be around 10%, so this doesn't seem to be an 
> overcommit issue.
> This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th). 
> Prior to the upgrade, it was running on Kafka 2.4.
> We last reduced retention for ops on May 7th, after which we restored 
> retention to our default of four days. This was the second time we've 
> temporarily reduced and restored retention since the upgrade. This problem 
> did not manifest the previous time we did so, nor did it manifest on our 
> other Kafka 3.7 clusters.
> We are running on AWS 
> [d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We 
> have 23 brokers, each with 24 disks. We're running in a JBOD configuration 
> (i.e. unraided).
> Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD, 
> we're still using Zookeeper.
> Sample broker logs are attached. The 05-12 and 05-14 logs are from separate 
> hosts. Please let me know if I can provide any further information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16779) Kafka retains logs past specified retention

2024-05-20 Thread Nicholas Feinberg (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847997#comment-17847997
 ] 

Nicholas Feinberg commented on KAFKA-16779:
---

When we explicitly set topics' retention to 4d (34560ms), our brokers 
immediately expired the surprisingly old logs.

I've confirmed that the same setting is present in the brokers' 
`server.properties` file - that is, they have `log.retention.hours=96`. I've 
also checked and confirmed that topics do not have an explicitly set retention 
that would override this.

> Kafka retains logs past specified retention
> ---
>
> Key: KAFKA-16779
> URL: https://issues.apache.org/jira/browse/KAFKA-16779
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Nicholas Feinberg
>Priority: Major
>  Labels: expiration, retention
> Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz, 
> kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz, 
> state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz
>
>
> In a Kafka cluster with all topics set to four days of retention or longer 
> (34560ms), most brokers seem to be retaining six days of data.
> This is true even for topics which have high throughput (500MB/s, 50k msgs/s) 
> and thus are regularly rolling new log segments. We observe this unexpectedly 
> high retention both via disk usage statistics and by requesting the oldest 
> available messages from Kafka.
> Some of these brokers crashed with an 'mmap failed' error (attached). When 
> those brokers started up again, they returned to the expected four days of 
> retention.
> Manually restarting brokers also seems to cause them to return to four days 
> of retention. Demoting and promoting brokers only has this effect on a small 
> part of the data hosted on a broker.
> These hosts had ~170GiB of free memory available. We saw no signs of pressure 
> on either system or JVM heap memory before or after they reported this error. 
> Committed memory seems to be around 10%, so this doesn't seem to be an 
> overcommit issue.
> This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th). 
> Prior to the upgrade, it was running on Kafka 2.4.
> We last reduced retention for ops on May 7th, after which we restored 
> retention to our default of four days. This was the second time we've 
> temporarily reduced and restored retention since the upgrade. This problem 
> did not manifest the previous time we did so, nor did it manifest on our 
> other Kafka 3.7 clusters.
> We are running on AWS 
> [d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We 
> have 23 brokers, each with 24 disks. We're running in a JBOD configuration 
> (i.e. unraided).
> Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD, 
> we're still using Zookeeper.
> Sample broker logs are attached. The 05-12 and 05-14 logs are from separate 
> hosts. Please let me know if I can provide any further information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16779) Kafka retains logs past specified retention

2024-05-16 Thread Nicholas Feinberg (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847001#comment-17847001
 ] 

Nicholas Feinberg commented on KAFKA-16779:
---

No problem.

> Kafka retains logs past specified retention
> ---
>
> Key: KAFKA-16779
> URL: https://issues.apache.org/jira/browse/KAFKA-16779
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Nicholas Feinberg
>Priority: Major
>  Labels: expiration, retention
> Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz, 
> kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz, 
> state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz
>
>
> In a Kafka cluster with all topics set to four days of retention or longer 
> (34560ms), most brokers seem to be retaining six days of data.
> This is true even for topics which have high throughput (500MB/s, 50k msgs/s) 
> and thus are regularly rolling new log segments. We observe this unexpectedly 
> high retention both via disk usage statistics and by requesting the oldest 
> available messages from Kafka.
> Some of these brokers crashed with an 'mmap failed' error (attached). When 
> those brokers started up again, they returned to the expected four days of 
> retention.
> Manually restarting brokers also seems to cause them to return to four days 
> of retention. Demoting and promoting brokers only has this effect on a small 
> part of the data hosted on a broker.
> These hosts had ~170GiB of free memory available. We saw no signs of pressure 
> on either system or JVM heap memory before or after they reported this error. 
> Committed memory seems to be around 10%, so this doesn't seem to be an 
> overcommit issue.
> This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th). 
> Prior to the upgrade, it was running on Kafka 2.4.
> We last reduced retention for ops on May 7th, after which we restored 
> retention to our default of four days. This was the second time we've 
> temporarily reduced and restored retention since the upgrade. This problem 
> did not manifest the previous time we did so, nor did it manifest on our 
> other Kafka 3.7 clusters.
> We are running on AWS 
> [d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We 
> have 23 brokers, each with 24 disks. We're running in a JBOD configuration 
> (i.e. unraided).
> Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD, 
> we're still using Zookeeper.
> Sample broker logs are attached. The 05-12 and 05-14 logs are from separate 
> hosts. Please let me know if I can provide any further information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16779) Kafka retains logs past specified retention

2024-05-15 Thread Nicholas Feinberg (Jira)
Nicholas Feinberg created KAFKA-16779:
-

 Summary: Kafka retains logs past specified retention
 Key: KAFKA-16779
 URL: https://issues.apache.org/jira/browse/KAFKA-16779
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.7.0
Reporter: Nicholas Feinberg
 Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz, 
kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz, 
state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz

In a Kafka cluster with all topics set to four days of retention or longer 
(34560ms), most brokers seem to be retaining six days of data.

This is true even for topics which have high throughput (500MB/s, 50k msgs/s) 
and thus are regularly rolling new log segments. We observe this unexpectedly 
high retention both via disk usage statistics and by requesting the oldest 
available messages from Kafka.

Some of these brokers crashed with an 'mmap failed' error (attached). When 
those brokers started up again, they returned to the expected four days of 
retention.

Manually restarting brokers also seems to cause them to return to four days of 
retention. Demoting and promoting brokers only has this effect on a small part 
of the data hosted on a broker.

These hosts had ~170GiB of free memory available. We saw no signs of pressure 
on either system or JVM heap memory before or after they reported this error. 
Committed memory seems to be around 10%, so this doesn't seem to be an 
overcommit issue.

This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th). Prior 
to the upgrade, it was running on Kafka 2.4.

We last reduced retention for ops on May 7th, after which we restored retention 
to our default of four days. This was the second time we've temporarily reduced 
and restored retention since the upgrade. This problem did not manifest the 
previous time we did so, nor did it manifest on our other Kafka 3.7 clusters.

We are running on AWS 
[d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We 
have 23 brokers, each with 24 disks. We're running in a JBOD configuration 
(i.e. unraided).

Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD, 
we're still using Zookeeper.

Sample broker logs are attached. The 05-12 and 05-14 logs are from separate 
hosts. Please let me know if I can provide any further information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-9227) Broker restart/snapshot times increase after upgrade from 1.1.0 to 2.3.1

2019-11-22 Thread Nicholas Feinberg (Jira)
Nicholas Feinberg created KAFKA-9227:


 Summary: Broker restart/snapshot times increase after upgrade from 
1.1.0 to 2.3.1
 Key: KAFKA-9227
 URL: https://issues.apache.org/jira/browse/KAFKA-9227
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.3.1
 Environment: Ubuntu 18, EC2, d2.8xlarge
Reporter: Nicholas Feinberg


I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While testing, 
I've noticed that shutting brokers down seems to take consistently longer on 
2.3.1. Specifically, the process of 'creating snapshots' seems to take several 
times longer than it did on 1.1.0. On a small testing setup, the time needed to 
create snapshots and shut down goes from ~20s to ~120s; with production-scale 
data, it goes from ~2min to ~30min.

The test hosts run about 384 partitions each (7 topics * 128 partitions each * 
3x replication / 7 brokers). The largest prod cluster has about 1344 
partitions/broker; the smallest and slowest has 2560.

In our largest prod cluster (16 d2.8xlarge broker cluster, 200k msg/s, 300 
MB/s), our restart cycles take about 3 minutes on 1.1.0 (counting ISR-rejoin 
time) and about 30 minutes on 2.3.1. The only other change we made between 
versions was increasing heap size from 8G to 16G.

To allow myself to roll back, I'm still using the 1.1 versions of the 
inter-broker protocol and the message format - is it possible that those could 
slow things down in 2.3.1? If not, any ideas what else could be at fault, or 
what I could do to narrow down the issue further?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)