[
https://issues.apache.org/jira/browse/KAFKA-19895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhangtongr updated KAFKA-19895:
-------------------------------
Description:
Description:
We encountered an abnormal situation with the Kafka system topic
`__consumer_offsets` where
two specific partition replicas grew to extremely large sizes and did not
respond to cleanup policies.
Details:
1. Two replicas under `__consumer_offsets` unexpectedly reached extremely large
sizes:
- Partition replica size: *{*}8.4 TB{*}*
- Partition replica size: *{*}3.9 TB{*}*
Other replicas of the same topic are only about *{*}100 MB{*}*, which is normal.
2. We attempted to force cleanup by applying:
cleanup.policy = compact,delete
retention.ms = <short value>
This worked normally for all other partitions in the cluster.
However, **the two abnormal partitions did not reduce in size at all**, even
after hours of waiting.
3. After restarting the Kafka brokers, cleanup and compaction finally resumed,
and the partitions returned to normal size.
Questions:
1. What could cause only two `__consumer_offsets` partitions to grow to
multi-TB sizes,
while others remain at ~100 MB?
2. Why did modifying the cleanup policy not take effect on these abnormal
partitions until a broker restart?
3. Could this be a known issue or bug in Kafka **2.7.1**, especially related to
log cleanup or compaction?
4. Are there scenarios in which compaction for an offset partition can stall or
freeze indefinitely?
5. What mitigation or preventive steps are recommended to avoid this problem in
long-running clusters?
Environment:
- Kafka version: **2.7.1**
- Deployment: Kubernetes
- Cluster has been running for a long period without restart
- Affected topic: `__consumer_offsets`
- Affected partitions: 2 (replicas reached 8.4T and 3.9T)
was:
Description:
We encountered an abnormal situation with the Kafka system topic
`__consumer_offsets` where
two specific partition replicas grew to extremely large sizes and did not
respond to cleanup policies.
Details:
1. Two replicas under `__consumer_offsets` unexpectedly reached extremely large
sizes:
- Partition replica size: **8.4 TB**
- Partition replica size: **3.9 TB**
Other replicas of the same topic are only about **100 MB**, which is normal.
2. We attempted to force cleanup by applying:
> __consumer_offsets Partitions Growing to TB Size and Not Cleaning Until
> Broker Restart
> --------------------------------------------------------------------------------------
>
> Key: KAFKA-19895
> URL: https://issues.apache.org/jira/browse/KAFKA-19895
> Project: Kafka
> Issue Type: Bug
> Components: log cleaner
> Affects Versions: 2.7.1
> Reporter: zhangtongr
> Priority: Critical
>
> Description:
> We encountered an abnormal situation with the Kafka system topic
> `__consumer_offsets` where
> two specific partition replicas grew to extremely large sizes and did not
> respond to cleanup policies.
> Details:
> 1. Two replicas under `__consumer_offsets` unexpectedly reached extremely
> large sizes:
> - Partition replica size: *{*}8.4 TB{*}*
> - Partition replica size: *{*}3.9 TB{*}*
> Other replicas of the same topic are only about *{*}100 MB{*}*, which is
> normal.
> 2. We attempted to force cleanup by applying:
> cleanup.policy = compact,delete
> retention.ms = <short value>
> This worked normally for all other partitions in the cluster.
> However, **the two abnormal partitions did not reduce in size at all**, even
> after hours of waiting.
> 3. After restarting the Kafka brokers, cleanup and compaction finally resumed,
> and the partitions returned to normal size.
> Questions:
> 1. What could cause only two `__consumer_offsets` partitions to grow to
> multi-TB sizes,
> while others remain at ~100 MB?
> 2. Why did modifying the cleanup policy not take effect on these abnormal
> partitions until a broker restart?
> 3. Could this be a known issue or bug in Kafka **2.7.1**, especially related
> to log cleanup or compaction?
> 4. Are there scenarios in which compaction for an offset partition can stall
> or freeze indefinitely?
> 5. What mitigation or preventive steps are recommended to avoid this problem
> in long-running clusters?
> Environment:
> - Kafka version: **2.7.1**
> - Deployment: Kubernetes
> - Cluster has been running for a long period without restart
> - Affected topic: `__consumer_offsets`
> - Affected partitions: 2 (replicas reached 8.4T and 3.9T)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)