David Arthur created KAFKA-13050: ------------------------------------ Summary: Race between controller creating snapshot and snapshot cleaning Key: KAFKA-13050 URL: https://issues.apache.org/jira/browse/KAFKA-13050 Project: Kafka Issue Type: Bug Components: controller, kraft Affects Versions: 3.0.0 Reporter: David Arthur
If the controller attempts to take a snapshot with its cached OffsetAndEpoch while snapshot cleaning is happening, it is possible for the OffsetAndEpoch to be invalidated due to truncation. {code} [2021-07-08 12:12:41,938] WARN [Controller 1] org.apache.kafka.controller.QuorumController@67e0d836: failed with unknown server exception IllegalArgumentException at epoch -1 in 3207460 us. Reverting to last committed offset 98. (org.apache.kafka.controller.QuorumController) java.lang.IllegalArgumentException: Snapshot id (OffsetAndEpoch(offset=99, epoch=5)) is not valid according to the log: ValidOffsetAndEpoch(kind=SNAPSHOT, offsetAndEpoch=OffsetAndEpoch(offset=180, epoch=8)) at kafka.raft.KafkaMetadataLog.createNewSnapshot(KafkaMetadataLog.scala:252) at org.apache.kafka.raft.KafkaRaftClient.lambda$createSnapshot$30(KafkaRaftClient.java:2334) at org.apache.kafka.snapshot.SnapshotWriter.createWithHeader(SnapshotWriter.java:134) at org.apache.kafka.raft.KafkaRaftClient.createSnapshot(KafkaRaftClient.java:2333) at org.apache.kafka.controller.QuorumController$SnapshotGeneratorManager.createSnapshotGenerator(QuorumController.java:351) at org.apache.kafka.controller.QuorumController.checkSnapshotGeneration(QuorumController.java:904) at org.apache.kafka.controller.QuorumController.access$3000(QuorumController.java:121) at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$handleCommit$0(QuorumController.java:681) at org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:311) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173) at java.lang.Thread.run(Thread.java:748) [2021-07-08 12:12:41,941] INFO [BrokerMetadataListener id=1] Loading snapshot 180-8. (kafka.server.metadata.BrokerMetadataListener) {code} This was observed while running a broker in combined mode with artificially low values for snapshot generation and cleaning. {code} metadata.log.max.record.bytes.between.snapshots=100 metadata.log.segment.bytes=1024 metadata.max.retention.bytes=4096 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)