[jira] [Commented] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360656#comment-17360656 ] Francisco Juan commented on KAFKA-8335: --- We solved the issue by updating to version 2.5.1, then the [offsets.retention.minutes|https://kafka.apache.org/documentation/#brokerconfigs_offsets.retention.minutes] kicked in and the problem disappeared. > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 1006 > producerEpoch: 2530 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 81 > offset: 3484332090 position: 627 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 4005 > producerEpoch: 2520 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 84 > ... > -- > Streams is doing transaction commits per 100ms (commit.interval.ms=100 when > exactly-once) so the __consumer_offsets is growing really fast. > Is this (to keep all transactions) by design, or is that a bug for > LogCleaner? What would be the way to clean up the topic? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050315#comment-17050315 ] Francisco Juan commented on KAFKA-7447: --- Hello all, after upgrading our cluster from version 2.2.1 to version 2.2.2 we haven't seen this error again. PS. As per my own investigation, this issue did correlate with KAFKA-8896 > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling > loading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.g
[jira] [Commented] (KAFKA-8896) NoSuchElementException after coordinator move
[ https://issues.apache.org/jira/browse/KAFKA-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050312#comment-17050312 ] Francisco Juan commented on KAFKA-8896: --- Upgrading the cluster to version 2.2.2 has fixed the issue. Thanks Jason for checking > NoSuchElementException after coordinator move > - > > Key: KAFKA-8896 > URL: https://issues.apache.org/jira/browse/KAFKA-8896 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0, 2.3.0, 2.2.1 >Reporter: Jason Gustafson >Assignee: Boyang Chen >Priority: Major > Fix For: 2.2.2, 2.3.1 > > > Caught this exception in the wild: > {code:java} > java.util.NoSuchElementException: key not found: > consumer-group-38981ebe-4361-44e7-b710-7d11f5d35639 > at scala.collection.MapLike.default(MapLike.scala:235) > at scala.collection.MapLike.default$(MapLike.scala:234) > at scala.collection.AbstractMap.default(Map.scala:63) > at scala.collection.mutable.HashMap.apply(HashMap.scala:69) > at kafka.coordinator.group.GroupMetadata.get(GroupMetadata.scala:214) > at > kafka.coordinator.group.GroupCoordinator.$anonfun$tryCompleteHeartbeat$1(GroupCoordinator.scala:1008) > at > scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253) > at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:209) > at > kafka.coordinator.group.GroupCoordinator.tryCompleteHeartbeat(GroupCoordinator.scala:1001) > at > kafka.coordinator.group.DelayedHeartbeat.tryComplete(DelayedHeartbeat.scala:34) > at > kafka.server.DelayedOperation.maybeTryComplete(DelayedOperation.scala:122) > at > kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:391) > at > kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:295) > at > kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextExpiration(GroupCoordinator.scala:802) > at > kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextHeartbeatExpiration(GroupCoordinator.scala:795) > at > kafka.coordinator.group.GroupCoordinator.$anonfun$handleHeartbeat$2(GroupCoordinator.scala:543) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253) > at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:209) > at > kafka.coordinator.group.GroupCoordinator.handleHeartbeat(GroupCoordinator.scala:516) > at kafka.server.KafkaApis.handleHeartbeatRequest(KafkaApis.scala:1617) > at kafka.server.KafkaApis.handle(KafkaApis.scala:155) {code} > > Looking at the logs, I see a coordinator change just prior to this exception. > The group was first unloaded as the coordinator moved to another broker and > then was loaded again as the coordinator was moved back. I am guessing that > somehow the delayed heartbeat is retaining the reference to the old > GroupMetadata instance. Not sure exactly how this can happen though. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988680#comment-16988680 ] Francisco Juan commented on KAFKA-7447: --- Thanks [~Karolis]. Would it be possible for you to describe an scenario where this issue is reproduced? I'd really like to confirm that updating the cluster version would fix this. Best regards > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling > loading of offsets and group metadata from __consumer_offsets-29 > (kafka.
[jira] [Commented] (KAFKA-8896) NoSuchElementException after coordinator move
[ https://issues.apache.org/jira/browse/KAFKA-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988654#comment-16988654 ] Francisco Juan commented on KAFKA-8896: --- Hello [~hachikuji], could you please provide the steps or scenario that triggers this error? We have a cluster running version 2.2.1 and it is throwing this same error +sometimes+ when there's a broker restart. We can correlate the error that you saw with an earlier error looking like "ERROR [GroupMetadataManager brokerId=2] Error loading offsets from __consumer_offsets". This happen short after a broker restarts. We can't reproduce this error in a test environment, which we would like to do to verify if an upgrade would actually fix our issue (some consumer-groups losing their offsets) Our cluster setup looks like this: {code:java} Kafka version: 2.2.1 Number of brokers: 30 Number of leader partitions: 15785 Number of consumer-groups: 1150 inter.broker.protocol.version=1.1 min.insync.replicas=2{code} Errors stack trace detail: {code:java} [2019-11-28 08:13:22,603] ERROR [KafkaApi-42] Error when handling request: clientId=enrichment-worker-kafka, correlationId=92, api=HEARTBEAT, body={group_id=enrichment-worker-importio-webhook-consumer-eu,generation_id=9877,member_id=enrichment-worker-kafka-25821b62-f36b-4e64-905b-92019e4a5493} (kafka.server.KafkaApis) java.util.NoSuchElementException: key not found: consumer-name-25821b62-f36b-4e64-905b-92019e4a5493 at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at kafka.coordinator.group.GroupMetadata.get(GroupMetadata.scala:203) at kafka.coordinator.group.GroupCoordinator.$anonfun$tryCompleteHeartbeat$1(GroupCoordinator.scala:927) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.tryCompleteHeartbeat(GroupCoordinator.scala:920) at kafka.coordinator.group.DelayedHeartbeat.tryComplete(DelayedHeartbeat.scala:34) at kafka.server.DelayedOperation.maybeTryComplete(DelayedOperation.scala:121) at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:388) at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:294) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextExpiration(GroupCoordinator.scala:737) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextHeartbeatExpiration(GroupCoordinator.scala:730) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleHeartbeat$2(GroupCoordinator.scala:486) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.handleHeartbeat(GroupCoordinator.scala:451) at kafka.server.KafkaApis.handleHeartbeatRequest(KafkaApis.scala:1336) at kafka.server.KafkaApis.handle(KafkaApis.scala:120) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) at java.lang.Thread.run(Thread.java:748) [2019-11-28 08:13:18,175] ERROR [GroupMetadataManager brokerId=42] Error loading offsets from __consumer_offsets-24 (kafka.coordinator.group.GroupMetadataManager) java.util.NoSuchElementException: key not found: consumer-name-de868651-3166-46df-98c5-6196b9ade526 at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at kafka.coordinator.group.GroupMetadata.get(GroupMetadata.scala:203) at kafka.coordinator.group.GroupCoordinator.$anonfun$tryCompleteHeartbeat$1(GroupCoordinator.scala:927) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.tryCompleteHeartbeat(GroupCoordinator.scala:920) at kafka.coordinator.group.DelayedHeartbeat.tryComplete(DelayedHeartbeat.scala:34) at kafka.server.DelayedOperation.maybeTryComplete(DelayedOperation.scala:121) at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:388) at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:294) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextExpiration(GroupCoordinator.scala:737) at kafka.coordinator.group.GroupCoordina
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988623#comment-16988623 ] Francisco Juan commented on KAFKA-7447: --- [~timvanlaer] we are seeing the same error that you saw with the message "Error loading offsets" message from the GroupMetadataManager class. We have two questions: 1. Were you able to reproduce the error? If so, could you give us a hint? 2. Did updating to version 2.2.2 fix the issue? Our setup looks like this: Kafka version: 2.2.1 Number of brokers: 30 Number of leader partitions: 15785 Number of consumer-groups: 1150 inter.broker.protocol.version=1.1 min.insync.replicas=2 > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broke
[jira] [Commented] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872013#comment-16872013 ] Francisco Juan commented on KAFKA-8335: --- I'm seeing this problem only in some partitions, check below partitions sorted by disk size: {code:java} | Topic | P | Disk (MB) | | __consumer_offsets | 6 | 23599 | | __consumer_offsets | 8 | 12285 | | __consumer_offsets | 43 | 4725 | | __consumer_offsets | 23 | 875.86| | __consumer_offsets | 31 | 443.35| | __consumer_offsets | 16 | 433.02| | __consumer_offsets | 18 | 418.52| | __consumer_offsets | 11 | 385.40| | __consumer_offsets | 14 | 308.03| | __consumer_offsets | 47 | 268.80| | __consumer_offsets | 13 | 254.77| | __consumer_offsets | 29 | 250.02| | __consumer_offsets | 30 | 233.38| | __consumer_offsets | 12 | 148.13| | __consumer_offsets | 46 | 116.77| | __consumer_offsets | 26 | 114.36| | __consumer_offsets | 3 | 101.09| ... | __consumer_offsets | 24 | 7.09 |{code} Our topic is using 50 partitions. > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 1006 > producerEpoch: 2530 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 81 > offset: 3484332090 position: 627 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 4005 > producerEpoch: 2520 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 84 > ... > -- > Streams is doing transaction commits per 100ms (commit.interval.ms=100 when > exactly-once) so the __consumer_offsets is growing really fast. > Is this (to keep all transactions) by design, or is that a bug for > LogCleaner? What would be the way to clean up the topic? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870892#comment-16870892 ] Francisco Juan edited comment on KAFKA-8335 at 6/24/19 7:55 AM: Hello [~boquan] I have applied your suggestion but seems to not be able to solve the issue for 2 out of 50 partitions. The biggest offender in my case would be partition 6 of __consumer_offsets which is now at 22GB. This is what the new state after I have cleaned up the cleaner-offset-checkpoint: {code:java} [2019-06-24 06:18:05,968] DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(0) now=1561357085930 => firstDirtyOffset=0 firstUncleanableOffset=5149534913 activeSegment.baseOffset=5149534913 (kafka.log.LogCleanerManager$){code} and this is the outcome of the cleaner: {code:java} [2019-06-24 06:25:48,969] INFO [kafka-log-cleaner-thread-0]: Log cleaner thread 0 cleaned log __consumer_offsets-6 (dirty section = [0, 0]) 25,096.6 MB of log processed in 463.0 seconds (54.2 MB/sec). Indexed 25,096.6 MB in 177.6 seconds (141.3 Mb/sec, 38.4% of total time) Buffer utilization: 0.0% Cleaned 25,096.6 MB in 285.4 seconds (87.9 Mb/sec, 61.6% of total time) Start size: 25,096.6 MB (313,976,723 messages) End size: 21,674.0 MB (291,368,805 messages) 13.6% size reduction (7.2% fewer messages) (kafka.log.LogCleaner){code} The partition currently has 226 log files and takes 22G of disk space. When sampling one of the earliest file in the partition, these are my results: {code:java} $> kafka-dump-log.sh --files 004114035438.log | head -n 5 Dumping 004114035438.log Starting offset: 4114035438 baseOffset: 4114035438 lastOffset: 4114035438 count: 1 baseSequence: -1 lastSequence: -1 producerId: 547022 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 0 CreateTime: 1556176906520 size: 78 magic: 2 compresscodec: NONE crc: 3444550803 isvalid: true baseOffset: 4114035439 lastOffset: 4114035439 count: 1 baseSequence: -1 lastSequence: -1 producerId: 551003 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 78 CreateTime: 1556176906524 size: 78 magic: 2 compresscodec: NONE crc: 517398751 isvalid: true baseOffset: 4114035442 lastOffset: 4114035442 count: 1 baseSequence: -1 lastSequence: -1 producerId: 556020 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 156 CreateTime: 1556176906560 size: 78 magic: 2 compresscodec: NONE crc: 174729981 isvalid: true $> kafka-dump-log.sh --files 004114035438.log | grep "isTransactional: true" | wc -l 1340521 $> kafka-dump-log.sh --files 004114035438.log | grep -v "isTransactional: true" | wc -l 2 {code} The timestamp shown here is from {{Thu Apr 25 07:21:46 UTC 2019}} What I think are relevant settings for my cluster are: * kafka cluster version: 2.2.1 * inter.broker.protocol.version: 1.1 * offsets.retention.minutes=20160 * __consumer_offsets settings: segment.bytes=104857600,cleanup.policy=compact * I've applied temporarily the suggestion from above: retention.ms=120960 (two weeks) and min.cleanable.dirty.ratio=0.2 What additional information could be useful to try and dig up the issue? Thanks was (Author: francisco.juan): Hello [~boquan] I have applied your suggestion but seems to not be able to solve the issue for 2 out of 50 partitions. The biggest offender in my case would be partition 6 of __consumer_offsets which is now at 22GB. This is what the new state after I have cleaned up the cleaner-offset-checkpoint: {code:java} [2019-06-24 06:18:05,968] DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(0) now=1561357085930 => firstDirtyOffset=0 firstUncleanableOffset=5149534913 activeSegment.baseOffset=5149534913 (kafka.log.LogCleanerManager$){code} and this is the outcome of the cleaner: {code:java} [2019-06-24 06:25:48,969] INFO [kafka-log-cleaner-thread-0]: Log cleaner thread 0 cleaned log __consumer_offsets-6 (dirty section = [0, 0]) 25,096.6 MB of log processed in 463.0 seconds (54.2 MB/sec). Indexed 25,096.6 MB in 177.6 seconds (141.3 Mb/sec, 38.4% of total time) Buffer utilization: 0.0% Cleaned 25,096.6 MB in 285.4 seconds (87.9 Mb/sec, 61.6% of total time) Start size: 25,096.6 MB (313,976,723 messages) End size: 21,674.0 MB (291,368,805 messages) 13.6% size reduction (7.2% fewer messages) (kafka.log.LogCleaner){code} The partition currently has 226 log files and takes 22G of disk space. When sampling one of the earliest file in the partition, these are my results: {code:java} $> kafka-dump-log.sh --files 004114035438.log | head -n 5 Dumping 004114035438.log Starting offset: 4114035438 baseOffset: 4114035438 l
[jira] [Commented] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870892#comment-16870892 ] Francisco Juan commented on KAFKA-8335: --- Hello [~boquan] I have applied your suggestion but seems to not be able to solve the issue for 2 out of 50 partitions. The biggest offender in my case would be partition 6 of __consumer_offsets which is now at 22GB. This is what the new state after I have cleaned up the cleaner-offset-checkpoint: {code:java} [2019-06-24 06:18:05,968] DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(0) now=1561357085930 => firstDirtyOffset=0 firstUncleanableOffset=5149534913 activeSegment.baseOffset=5149534913 (kafka.log.LogCleanerManager$){code} and this is the outcome of the cleaner: {code:java} [2019-06-24 06:25:48,969] INFO [kafka-log-cleaner-thread-0]: Log cleaner thread 0 cleaned log __consumer_offsets-6 (dirty section = [0, 0]) 25,096.6 MB of log processed in 463.0 seconds (54.2 MB/sec). Indexed 25,096.6 MB in 177.6 seconds (141.3 Mb/sec, 38.4% of total time) Buffer utilization: 0.0% Cleaned 25,096.6 MB in 285.4 seconds (87.9 Mb/sec, 61.6% of total time) Start size: 25,096.6 MB (313,976,723 messages) End size: 21,674.0 MB (291,368,805 messages) 13.6% size reduction (7.2% fewer messages) (kafka.log.LogCleaner){code} The partition currently has 226 log files and takes 22G of disk space. When sampling one of the earliest file in the partition, these are my results: {code:java} $> kafka-dump-log.sh --files 004114035438.log | head -n 5 Dumping 004114035438.log Starting offset: 4114035438 baseOffset: 4114035438 lastOffset: 4114035438 count: 1 baseSequence: -1 lastSequence: -1 producerId: 547022 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 0 CreateTime: 1556176906520 size: 78 magic: 2 compresscodec: NONE crc: 3444550803 isvalid: true baseOffset: 4114035439 lastOffset: 4114035439 count: 1 baseSequence: -1 lastSequence: -1 producerId: 551003 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 78 CreateTime: 1556176906524 size: 78 magic: 2 compresscodec: NONE crc: 517398751 isvalid: true baseOffset: 4114035442 lastOffset: 4114035442 count: 1 baseSequence: -1 lastSequence: -1 producerId: 556020 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 156 CreateTime: 1556176906560 size: 78 magic: 2 compresscodec: NONE crc: 174729981 isvalid: true $> kafka-dump-log.sh --files 004114035438.log | grep "isTransactional: true" | wc -l 1340521 $> kafka-dump-log.sh --files 004114035438.log | grep -v "isTransactional: true" | wc -l 2 {code} The timestamp shown here is from {{Thu Apr 25 07:21:46 UTC 2019}} What I think are relevant settings for my cluster are: * offsets.retention.minutes=20160 * __consumer_offsets settings: segment.bytes=104857600,cleanup.policy=compact * I've applied temporarily the suggestion from above: retention.ms=120960 (two weeks) and min.cleanable.dirty.ratio=0.2 What additional information could be useful to try and dig up the issue? Thanks > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 15
[jira] [Comment Edited] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870892#comment-16870892 ] Francisco Juan edited comment on KAFKA-8335 at 6/24/19 7:44 AM: Hello [~boquan] I have applied your suggestion but seems to not be able to solve the issue for 2 out of 50 partitions. The biggest offender in my case would be partition 6 of __consumer_offsets which is now at 22GB. This is what the new state after I have cleaned up the cleaner-offset-checkpoint: {code:java} [2019-06-24 06:18:05,968] DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(0) now=1561357085930 => firstDirtyOffset=0 firstUncleanableOffset=5149534913 activeSegment.baseOffset=5149534913 (kafka.log.LogCleanerManager$){code} and this is the outcome of the cleaner: {code:java} [2019-06-24 06:25:48,969] INFO [kafka-log-cleaner-thread-0]: Log cleaner thread 0 cleaned log __consumer_offsets-6 (dirty section = [0, 0]) 25,096.6 MB of log processed in 463.0 seconds (54.2 MB/sec). Indexed 25,096.6 MB in 177.6 seconds (141.3 Mb/sec, 38.4% of total time) Buffer utilization: 0.0% Cleaned 25,096.6 MB in 285.4 seconds (87.9 Mb/sec, 61.6% of total time) Start size: 25,096.6 MB (313,976,723 messages) End size: 21,674.0 MB (291,368,805 messages) 13.6% size reduction (7.2% fewer messages) (kafka.log.LogCleaner){code} The partition currently has 226 log files and takes 22G of disk space. When sampling one of the earliest file in the partition, these are my results: {code:java} $> kafka-dump-log.sh --files 004114035438.log | head -n 5 Dumping 004114035438.log Starting offset: 4114035438 baseOffset: 4114035438 lastOffset: 4114035438 count: 1 baseSequence: -1 lastSequence: -1 producerId: 547022 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 0 CreateTime: 1556176906520 size: 78 magic: 2 compresscodec: NONE crc: 3444550803 isvalid: true baseOffset: 4114035439 lastOffset: 4114035439 count: 1 baseSequence: -1 lastSequence: -1 producerId: 551003 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 78 CreateTime: 1556176906524 size: 78 magic: 2 compresscodec: NONE crc: 517398751 isvalid: true baseOffset: 4114035442 lastOffset: 4114035442 count: 1 baseSequence: -1 lastSequence: -1 producerId: 556020 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 156 CreateTime: 1556176906560 size: 78 magic: 2 compresscodec: NONE crc: 174729981 isvalid: true $> kafka-dump-log.sh --files 004114035438.log | grep "isTransactional: true" | wc -l 1340521 $> kafka-dump-log.sh --files 004114035438.log | grep -v "isTransactional: true" | wc -l 2 {code} The timestamp shown here is from {{Thu Apr 25 07:21:46 UTC 2019}} What I think are relevant settings for my cluster are: * offsets.retention.minutes=20160 * __consumer_offsets settings: segment.bytes=104857600,cleanup.policy=compact * I've applied temporarily the suggestion from above: retention.ms=120960 (two weeks) and min.cleanable.dirty.ratio=0.2 What additional information could be useful to try and dig up the issue? Thanks was (Author: francisco.juan): Hello [~boquan] I have applied your suggestion but seems to not be able to solve the issue for 2 out of 50 partitions. The biggest offender in my case would be partition 6 of __consumer_offsets which is now at 22GB. This is what the new state after I have cleaned up the cleaner-offset-checkpoint: {code:java} [2019-06-24 06:18:05,968] DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(0) now=1561357085930 => firstDirtyOffset=0 firstUncleanableOffset=5149534913 activeSegment.baseOffset=5149534913 (kafka.log.LogCleanerManager$){code} and this is the outcome of the cleaner: {code:java} [2019-06-24 06:25:48,969] INFO [kafka-log-cleaner-thread-0]: Log cleaner thread 0 cleaned log __consumer_offsets-6 (dirty section = [0, 0]) 25,096.6 MB of log processed in 463.0 seconds (54.2 MB/sec). Indexed 25,096.6 MB in 177.6 seconds (141.3 Mb/sec, 38.4% of total time) Buffer utilization: 0.0% Cleaned 25,096.6 MB in 285.4 seconds (87.9 Mb/sec, 61.6% of total time) Start size: 25,096.6 MB (313,976,723 messages) End size: 21,674.0 MB (291,368,805 messages) 13.6% size reduction (7.2% fewer messages) (kafka.log.LogCleaner){code} The partition currently has 226 log files and takes 22G of disk space. When sampling one of the earliest file in the partition, these are my results: {code:java} $> kafka-dump-log.sh --files 004114035438.log | head -n 5 Dumping 004114035438.log Starting offset: 4114035438 baseOffset: 4114035438 lastOffset: 4114035438 count: 1 baseSequence: -1 lastS
[jira] [Issue Comment Deleted] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francisco Juan updated KAFKA-8335: -- Comment: was deleted (was: Hello [~boquan]) > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 1006 > producerEpoch: 2530 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 81 > offset: 3484332090 position: 627 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 4005 > producerEpoch: 2520 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 84 > ... > -- > Streams is doing transaction commits per 100ms (commit.interval.ms=100 when > exactly-once) so the __consumer_offsets is growing really fast. > Is this (to keep all transactions) by design, or is that a bug for > LogCleaner? What would be the way to clean up the topic? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870881#comment-16870881 ] Francisco Juan commented on KAFKA-8335: --- Hello [~boquan] > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 1006 > producerEpoch: 2530 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 81 > offset: 3484332090 position: 627 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 4005 > producerEpoch: 2520 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 84 > ... > -- > Streams is doing transaction commits per 100ms (commit.interval.ms=100 when > exactly-once) so the __consumer_offsets is growing really fast. > Is this (to keep all transactions) by design, or is that a bug for > LogCleaner? What would be the way to clean up the topic? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868545#comment-16868545 ] Francisco Juan commented on KAFKA-8335: --- Hi Boquan Cleaning up the {{cleaner-offset-checkpoint}} seems to be solving the issue. Thanks a lot for the quick reply! > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 1006 > producerEpoch: 2530 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 81 > offset: 3484332090 position: 627 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 4005 > producerEpoch: 2520 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 84 > ... > -- > Streams is doing transaction commits per 100ms (commit.interval.ms=100 when > exactly-once) so the __consumer_offsets is growing really fast. > Is this (to keep all transactions) by design, or is that a bug for > LogCleaner? What would be the way to clean up the topic? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1685#comment-1685 ] Francisco Juan edited comment on KAFKA-8335 at 6/18/19 2:51 PM: Hello, we have recently updated a Kafka cluster with this same problem from version 1.1 to version 2.2.1, without updating the inter.broker.protocol.version yet, still set as 1.1. We were expecting this update to reduce the size on some partitions of __consumer_offsets that keep growing. The observed behaviour is that there's still many segments with full of only "isTransactional: true" kind of messages. This is a sample of the kafka-dump-log.sh: {code:java} /usr/kafka/bin/kafka-dump-log.sh --files 004107011120.log --value-decoder-class "kafka.serializer.StringDecoder" | head -n 20 Dumping 004107011120.log Starting offset: 4107011120 baseOffset: 4107011154 lastOffset: 4107011154 count: 1 baseSequence: -1 lastSequence: -1 producerId: 558010 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 0 CreateTime: 1556123964832 size: 78 magic: 2 compresscodec: NONE crc: 1007341472 isvalid: true | offset: 4107011154 CreateTime: 1556123964832 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84 baseOffset: 4107011178 lastOffset: 4107011178 count: 1 baseSequence: -1 lastSequence: -1 producerId: 559002 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 78 CreateTime: 1556123964895 size: 78 magic: 2 compresscodec: NONE crc: 470005994 isvalid: true | offset: 4107011178 CreateTime: 1556123964895 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84 baseOffset: 4107011180 lastOffset: 4107011180 count: 1 baseSequence: -1 lastSequence: -1 producerId: 559002 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 156 CreateTime: 1556123964916 size: 78 magic: 2 compresscodec: NONE crc: 681157535 isvalid: true | offset: 4107011180 CreateTime: 1556123964916 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84{code} This command is executed on Jun 18th. The `offsets.retention.minutes` is set to 40 days. The timestamps shown on this dump are way beyond the retention period. The LogCleaner DEBUG log is next: {code:java} DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(5006278217) now=1560855479531 => firstDirtyOffset=5006278217 firstUncleanableOffset=5069232666 activeSegment.baseOffset=5069232666 (kafka.log.LogCleanerManager$) {code} Offsets shown on the dump are not on the active segment and are way below the firstUncleanbleOffset was (Author: francisco.juan): Hello, we have recently updated a Kafka cluster with this same problem from version 1.1 to version 2.2.1, without updating the inter.broker.protocol.version yet, still set as 1.1. We were expecting this update to reduce the size on some partitions of __consumer_offsets that keep growing. The observed behaviour is that there's still many segments with full of only "isTransactional: true" kind of. This is a sample of the kafka-dump-log.sh: {code:java} /usr/kafka/bin/kafka-dump-log.sh --files 004107011120.log --value-decoder-class "kafka.serializer.StringDecoder" | head -n 20 Dumping 004107011120.log Starting offset: 4107011120 baseOffset: 4107011154 lastOffset: 4107011154 count: 1 baseSequence: -1 lastSequence: -1 producerId: 558010 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 0 CreateTime: 1556123964832 size: 78 magic: 2 compresscodec: NONE crc: 1007341472 isvalid: true | offset: 4107011154 CreateTime: 1556123964832 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84 baseOffset: 4107011178 lastOffset: 4107011178 count: 1 baseSequence: -1 lastSequence: -1 producerId: 559002 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 78 CreateTime: 1556123964895 size: 78 magic: 2 compresscodec: NONE crc: 470005994 isvalid: true | offset: 4107011178 CreateTime: 1556123964895 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84 baseOffset: 4107011180 lastOffset: 4107011180 count: 1 baseSequence: -1 lastSequence: -1 producerId: 559002 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 156 CreateTime: 1556123964916 size: 78 magic: 2 compresscodec: NONE crc: 681157535 isvalid: true | offset: 4107011180 CreateTime: 1556123964916 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84{code} This command is executed on Jun 18th. The `offsets.retention.minutes` is
[jira] [Comment Edited] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1685#comment-1685 ] Francisco Juan edited comment on KAFKA-8335 at 6/18/19 2:50 PM: Hello, we have recently updated a Kafka cluster with this same problem from version 1.1 to version 2.2.1, without updating the inter.broker.protocol.version yet, still set as 1.1. We were expecting this update to reduce the size on some partitions of __consumer_offsets that keep growing. The observed behaviour is that there's still many segments with full of only "isTransactional: true" kind of. This is a sample of the kafka-dump-log.sh: {code:java} /usr/kafka/bin/kafka-dump-log.sh --files 004107011120.log --value-decoder-class "kafka.serializer.StringDecoder" | head -n 20 Dumping 004107011120.log Starting offset: 4107011120 baseOffset: 4107011154 lastOffset: 4107011154 count: 1 baseSequence: -1 lastSequence: -1 producerId: 558010 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 0 CreateTime: 1556123964832 size: 78 magic: 2 compresscodec: NONE crc: 1007341472 isvalid: true | offset: 4107011154 CreateTime: 1556123964832 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84 baseOffset: 4107011178 lastOffset: 4107011178 count: 1 baseSequence: -1 lastSequence: -1 producerId: 559002 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 78 CreateTime: 1556123964895 size: 78 magic: 2 compresscodec: NONE crc: 470005994 isvalid: true | offset: 4107011178 CreateTime: 1556123964895 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84 baseOffset: 4107011180 lastOffset: 4107011180 count: 1 baseSequence: -1 lastSequence: -1 producerId: 559002 producerEpoch: 0 partitionLeaderEpoch: 490 isTransactional: true isControl: true position: 156 CreateTime: 1556123964916 size: 78 magic: 2 compresscodec: NONE crc: 681157535 isvalid: true | offset: 4107011180 CreateTime: 1556123964916 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 84{code} This command is executed on Jun 18th. The `offsets.retention.minutes` is set to 40 days. The timestamps shown on this dump are way beyond the retention period. The LogCleaner DEBUG log is next: {code:java} DEBUG Finding range of cleanable offsets for log=__consumer_offsets-6 topicPartition=__consumer_offsets-6. Last clean offset=Some(5006278217) now=1560855479531 => firstDirtyOffset=5006278217 firstUncleanableOffset=5069232666 activeSegment.baseOffset=5069232666 (kafka.log.LogCleanerManager$) {code} Offsets shown on the dump are not on the active segment and are way below the firstUncleanbleOffset was (Author: francisco.juan): Hello, we have recently updated a Kafka cluster with this same problem from version 1.1 to version 2.2.1, without updating the inter.broker.protocol.version yet, still set as 1.1. We were expecting this update to reduce the size on some partitions of __consumer_offsets that keep growing. The observed behaviour is that there's still many segments with full of only "isTransactional: true" kind of. This is a sample of the kafka-dump-log.sh: > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 348433206
[jira] [Commented] (KAFKA-8335) Log cleaner skips Transactional mark and batch record, causing unlimited growth of __consumer_offsets
[ https://issues.apache.org/jira/browse/KAFKA-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1685#comment-1685 ] Francisco Juan commented on KAFKA-8335: --- Hello, we have recently updated a Kafka cluster with this same problem from version 1.1 to version 2.2.1, without updating the inter.broker.protocol.version yet, still set as 1.1. We were expecting this update to reduce the size on some partitions of __consumer_offsets that keep growing. The observed behaviour is that there's still many segments with full of only "isTransactional: true" kind of. This is a sample of the kafka-dump-log.sh: > Log cleaner skips Transactional mark and batch record, causing unlimited > growth of __consumer_offsets > - > > Key: KAFKA-8335 > URL: https://issues.apache.org/jira/browse/KAFKA-8335 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Boquan Tang >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > Attachments: seg_april_25.zip, segment.zip > > > My Colleague Weichu already sent out a mail to kafka user mailing list > regarding this issue, but we think it's worth having a ticket tracking it. > We are using Kafka Streams with exactly-once enabled on a Kafka cluster for > a while. > Recently we found that the size of __consumer_offsets partitions grew huge. > Some partition went over 30G. This caused Kafka to take quite long to load > "__consumer_offsets" topic on startup (it loads the topic in order to > become group coordinator). > We dumped the __consumer_offsets segments and found that while normal > offset commits are nicely compacted, transaction records (COMMIT, etc) are > all preserved. Looks like that since these messages don't have a key, the > LogCleaner is keeping them all: > -- > $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > /003484332061.log --key-decoder-class > kafka.serializer.StringDecoder 2>/dev/null | cat -v | head > Dumping 003484332061.log > Starting offset: 3484332061 > offset: 3484332089 position: 549 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 1006 > producerEpoch: 2530 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 81 > offset: 3484332090 position: 627 CreateTime: 1556003706952 isvalid: true > keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 4005 > producerEpoch: 2520 sequence: -1 isTransactional: true headerKeys: [] > endTxnMarker: COMMIT coordinatorEpoch: 84 > ... > -- > Streams is doing transaction commits per 100ms (commit.interval.ms=100 when > exactly-once) so the __consumer_offsets is growing really fast. > Is this (to keep all transactions) by design, or is that a bug for > LogCleaner? What would be the way to clean up the topic? -- This message was sent by Atlassian JIRA (v7.6.3#76005)