Re: [PR] KAFKA-19479: at_least_once mode in Kafka Streams silently drops messages when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees [kafka]

via GitHub Mon, 13 Oct 2025 15:45:56 -0700


junrao commented on code in PR #20285:
URL: https://github.com/apache/kafka/pull/20285#discussion_r2427402158



##########
clients/src/main/java/org/apache/kafka/clients/producer/internals/ProduceRequestResult.java:
##########
@@ -70,7 +75,29 @@ public void done() {
     }
 
     /**
-     * Await the completion of this request
+     * Add a dependent ProduceRequestResult that must complete before this 
result is considered complete.
+     * This is used when a batch is split into multiple batches - the original 
batch's result
+     * should not complete until all split batches have completed.

Review Comment:
   >  the original batch's result should not complete until all split batches 
have completed.
   
   In some cases like flush(), the original batch's result should not complete 
until all split batches have completed.
   



##########
clients/src/test/java/org/apache/kafka/clients/producer/internals/RecordAccumulatorTest.java:
##########
@@ -1790,4 +1828,65 @@ public void 
testSplitAndReenqueuePreventInfiniteRecursion() throws InterruptedEx
         // Verify all original records are accounted for (no data loss)
         assertEquals(100, keyFoundMap.size(), "All original 100 records should 
be present after splitting");
     }
+
+    @Test
+    public void testProduceRequestResultawaitAllDependents() throws Exception {
+        ProduceRequestResult parent = new ProduceRequestResult(tp1);
+
+        // make two dependent ProduceRequestResults -- mimicking split batches
+        ProduceRequestResult dependent1 = new ProduceRequestResult(tp1);
+        ProduceRequestResult dependent2 = new ProduceRequestResult(tp1);
+
+        // add dependents
+        parent.addDependentResult(dependent1);
+        parent.addDependentResult(dependent2);
+
+        parent.set(0L, RecordBatch.NO_TIMESTAMP, null);
+        parent.done();
+
+        // parent.completed() should return true (only checks latch)
+        assertTrue(parent.completed(), "Parent should be completed after 
done()");
+
+        // awaitAllDependents() should block because dependents are not 
complete
+        final AtomicBoolean awaitCompleted = new AtomicBoolean(false);
+        final AtomicReference<Exception> awaitException = new 
AtomicReference<>();
+
+        // to prove awaitAllDependents() is blocking, we run it in a separate 
thread
+        Thread awaitThread = new Thread(() -> {
+            try {
+                parent.awaitAllDependents();
+                awaitCompleted.set(true);
+            } catch (Exception e) {
+                awaitException.set(e);
+            }
+        });
+        awaitThread.start();
+        Thread.sleep(100);

Review Comment:
   100 ms is too long for a unit test. How about 5ms?



##########
clients/src/test/java/org/apache/kafka/clients/producer/internals/RecordAccumulatorTest.java:
##########
@@ -1790,4 +1828,65 @@ public void 
testSplitAndReenqueuePreventInfiniteRecursion() throws InterruptedEx
         // Verify all original records are accounted for (no data loss)
         assertEquals(100, keyFoundMap.size(), "All original 100 records should 
be present after splitting");
     }
+
+    @Test
+    public void testProduceRequestResultawaitAllDependents() throws Exception {

Review Comment:
   await => Await



##########
clients/src/main/java/org/apache/kafka/clients/producer/internals/ProduceRequestResult.java:
##########
@@ -33,6 +35,7 @@ public class ProduceRequestResult {
 
     private final CountDownLatch latch = new CountDownLatch(1);
     private final TopicPartition topicPartition;
+    private final List<ProduceRequestResult> dependentResults = new 
ArrayList<>();

Review Comment:
   Could we add a comment to explain what dependentResults is?



##########
streams/integration-tests/src/test/java/org/apache/kafka/streams/integration/AtLeastOnceDeliveryMessageLossIntegrationTest.java:
##########
@@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.streams.integration;
+
+import org.apache.kafka.clients.admin.Admin;
+import org.apache.kafka.clients.admin.AdminClientConfig;
+import org.apache.kafka.clients.producer.ProducerConfig;
+import org.apache.kafka.clients.producer.internals.Sender;
+import org.apache.kafka.common.TopicPartition;
+import org.apache.kafka.common.serialization.Serdes;
+import org.apache.kafka.common.serialization.StringDeserializer;
+import org.apache.kafka.common.serialization.StringSerializer;
+import org.apache.kafka.common.utils.LogCaptureAppender;
+import org.apache.kafka.streams.KafkaStreams;
+import org.apache.kafka.streams.KeyValue;
+import org.apache.kafka.streams.StreamsBuilder;
+import org.apache.kafka.streams.StreamsConfig;
+import org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster;
+import org.apache.kafka.streams.integration.utils.IntegrationTestUtils;
+import org.apache.kafka.streams.kstream.KStream;
+import org.apache.kafka.test.TestUtils;
+
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.TestInfo;
+import org.junit.jupiter.api.Timeout;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+
+import static org.apache.kafka.common.utils.Utils.mkEntry;
+import static org.apache.kafka.common.utils.Utils.mkMap;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.DEFAULT_TIMEOUT;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.cleanStateBeforeTest;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.startApplicationAndWaitUntilRunning;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived;
+import static org.apache.kafka.streams.utils.TestUtils.safeUniqueTestName;
+import static org.apache.kafka.test.TestUtils.consumerConfig;
+import static org.apache.kafka.test.TestUtils.producerConfig;
+import static org.apache.kafka.test.TestUtils.waitForCondition;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+
+@Timeout(600)
+@Tag("integration")
+public class AtLeastOnceDeliveryMessageLossIntegrationTest {
+    private static final Logger log = LoggerFactory.getLogger(
+        AtLeastOnceDeliveryMessageLossIntegrationTest.class);
+    
+    private static final int NUM_BROKERS = 1;
+    private static final int LARGE_RECORD_COUNT = 50000;
+    private static final int SMALL_RECORD_COUNT = 40000;
+    
+    public static final EmbeddedKafkaCluster CLUSTER = new 
EmbeddedKafkaCluster(NUM_BROKERS);
+    
+    @BeforeAll
+    public static void startCluster() throws IOException {
+        CLUSTER.start();
+    }
+    
+    @AfterAll
+    public static void closeCluster() {
+        CLUSTER.stop();
+    }
+    
+    private String applicationId;
+    private String inputTopic;
+    private String outputTopic;
+    private Properties streamsConfiguration;
+    private KafkaStreams kafkaStreams;
+    
+    @BeforeEach
+    public void setUp(final TestInfo testInfo) throws Exception {
+        final String testId = safeUniqueTestName(testInfo);
+        applicationId = "app-" + testId;
+        inputTopic = "input-" + testId;
+        outputTopic = "output-" + testId;
+        
+        cleanStateBeforeTest(CLUSTER, inputTopic, outputTopic);
+        CLUSTER.createTopics(inputTopic, outputTopic);
+        
+        setupStreamsConfiguration();
+    }
+    
+    @AfterEach
+    public void cleanUp() throws Exception {
+        if (kafkaStreams != null) {
+            kafkaStreams.close();
+        }
+        if (streamsConfiguration != null) {
+            purgeLocalStreamsState(streamsConfiguration);
+        }
+    }
+
+    // failing test
+    @Test
+    public void 
shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLarge()
 throws Exception {
+        
+        try (final LogCaptureAppender appender = 
LogCaptureAppender.createAndRegister(Sender.class)) {
+            produceInputData(LARGE_RECORD_COUNT);

Review Comment:
   produceInputData() doesn't produce large records. 



##########
clients/src/main/java/org/apache/kafka/clients/producer/internals/ProduceRequestResult.java:
##########
@@ -70,7 +75,29 @@ public void done() {
     }
 
     /**
-     * Await the completion of this request
+     * Add a dependent ProduceRequestResult that must complete before this 
result is considered complete.
+     * This is used when a batch is split into multiple batches - the original 
batch's result
+     * should not complete until all split batches have completed.
+     *
+     * @param dependentResult The dependent result to wait for
+     */
+    public void addDependentResult(ProduceRequestResult dependentResult) {

Review Comment:
   addDependentResult => addDependent ?



##########
streams/integration-tests/src/test/java/org/apache/kafka/streams/integration/AtLeastOnceDeliveryMessageLossIntegrationTest.java:
##########
@@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.streams.integration;
+
+import org.apache.kafka.clients.admin.Admin;
+import org.apache.kafka.clients.admin.AdminClientConfig;
+import org.apache.kafka.clients.producer.ProducerConfig;
+import org.apache.kafka.clients.producer.internals.Sender;
+import org.apache.kafka.common.TopicPartition;
+import org.apache.kafka.common.serialization.Serdes;
+import org.apache.kafka.common.serialization.StringDeserializer;
+import org.apache.kafka.common.serialization.StringSerializer;
+import org.apache.kafka.common.utils.LogCaptureAppender;
+import org.apache.kafka.streams.KafkaStreams;
+import org.apache.kafka.streams.KeyValue;
+import org.apache.kafka.streams.StreamsBuilder;
+import org.apache.kafka.streams.StreamsConfig;
+import org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster;
+import org.apache.kafka.streams.integration.utils.IntegrationTestUtils;
+import org.apache.kafka.streams.kstream.KStream;
+import org.apache.kafka.test.TestUtils;
+
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.TestInfo;
+import org.junit.jupiter.api.Timeout;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+
+import static org.apache.kafka.common.utils.Utils.mkEntry;
+import static org.apache.kafka.common.utils.Utils.mkMap;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.DEFAULT_TIMEOUT;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.cleanStateBeforeTest;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.startApplicationAndWaitUntilRunning;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived;
+import static org.apache.kafka.streams.utils.TestUtils.safeUniqueTestName;
+import static org.apache.kafka.test.TestUtils.consumerConfig;
+import static org.apache.kafka.test.TestUtils.producerConfig;
+import static org.apache.kafka.test.TestUtils.waitForCondition;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+
+@Timeout(600)
+@Tag("integration")
+public class AtLeastOnceDeliveryMessageLossIntegrationTest {
+    private static final Logger log = LoggerFactory.getLogger(
+        AtLeastOnceDeliveryMessageLossIntegrationTest.class);
+    
+    private static final int NUM_BROKERS = 1;
+    private static final int LARGE_RECORD_COUNT = 50000;
+    private static final int SMALL_RECORD_COUNT = 40000;
+    
+    public static final EmbeddedKafkaCluster CLUSTER = new 
EmbeddedKafkaCluster(NUM_BROKERS);
+    
+    @BeforeAll
+    public static void startCluster() throws IOException {
+        CLUSTER.start();
+    }
+    
+    @AfterAll
+    public static void closeCluster() {
+        CLUSTER.stop();
+    }
+    
+    private String applicationId;
+    private String inputTopic;
+    private String outputTopic;
+    private Properties streamsConfiguration;
+    private KafkaStreams kafkaStreams;
+    
+    @BeforeEach
+    public void setUp(final TestInfo testInfo) throws Exception {
+        final String testId = safeUniqueTestName(testInfo);
+        applicationId = "app-" + testId;
+        inputTopic = "input-" + testId;
+        outputTopic = "output-" + testId;
+        
+        cleanStateBeforeTest(CLUSTER, inputTopic, outputTopic);
+        CLUSTER.createTopics(inputTopic, outputTopic);
+        
+        setupStreamsConfiguration();
+    }
+    
+    @AfterEach
+    public void cleanUp() throws Exception {
+        if (kafkaStreams != null) {
+            kafkaStreams.close();
+        }
+        if (streamsConfiguration != null) {
+            purgeLocalStreamsState(streamsConfiguration);
+        }
+    }
+
+    // failing test
+    @Test
+    public void 
shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLarge()
 throws Exception {
+        
+        try (final LogCaptureAppender appender = 
LogCaptureAppender.createAndRegister(Sender.class)) {
+            produceInputData(LARGE_RECORD_COUNT);
+
+            kafkaStreams = createStreamsApplication();

Review Comment:
   Where does the stream app create large records?



##########
streams/integration-tests/src/test/java/org/apache/kafka/streams/integration/AtLeastOnceDeliveryMessageLossIntegrationTest.java:
##########
@@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.streams.integration;
+
+import org.apache.kafka.clients.admin.Admin;
+import org.apache.kafka.clients.admin.AdminClientConfig;
+import org.apache.kafka.clients.producer.ProducerConfig;
+import org.apache.kafka.clients.producer.internals.Sender;
+import org.apache.kafka.common.TopicPartition;
+import org.apache.kafka.common.serialization.Serdes;
+import org.apache.kafka.common.serialization.StringDeserializer;
+import org.apache.kafka.common.serialization.StringSerializer;
+import org.apache.kafka.common.utils.LogCaptureAppender;
+import org.apache.kafka.streams.KafkaStreams;
+import org.apache.kafka.streams.KeyValue;
+import org.apache.kafka.streams.StreamsBuilder;
+import org.apache.kafka.streams.StreamsConfig;
+import org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster;
+import org.apache.kafka.streams.integration.utils.IntegrationTestUtils;
+import org.apache.kafka.streams.kstream.KStream;
+import org.apache.kafka.test.TestUtils;
+
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.TestInfo;
+import org.junit.jupiter.api.Timeout;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.time.Duration;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+
+import static org.apache.kafka.common.utils.Utils.mkEntry;
+import static org.apache.kafka.common.utils.Utils.mkMap;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.DEFAULT_TIMEOUT;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.cleanStateBeforeTest;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.startApplicationAndWaitUntilRunning;
+import static 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived;
+import static org.apache.kafka.streams.utils.TestUtils.safeUniqueTestName;
+import static org.apache.kafka.test.TestUtils.consumerConfig;
+import static org.apache.kafka.test.TestUtils.producerConfig;
+import static org.apache.kafka.test.TestUtils.waitForCondition;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+
+@Timeout(600)
+@Tag("integration")
+public class AtLeastOnceDeliveryMessageLossIntegrationTest {
+    private static final Logger log = LoggerFactory.getLogger(
+        AtLeastOnceDeliveryMessageLossIntegrationTest.class);
+    
+    private static final int NUM_BROKERS = 1;
+    private static final int LARGE_RECORD_COUNT = 50000;
+    private static final int SMALL_RECORD_COUNT = 40000;
+    
+    public static final EmbeddedKafkaCluster CLUSTER = new 
EmbeddedKafkaCluster(NUM_BROKERS);
+    
+    @BeforeAll
+    public static void startCluster() throws IOException {
+        CLUSTER.start();
+    }
+    
+    @AfterAll
+    public static void closeCluster() {
+        CLUSTER.stop();
+    }
+    
+    private String applicationId;
+    private String inputTopic;
+    private String outputTopic;
+    private Properties streamsConfiguration;
+    private KafkaStreams kafkaStreams;
+    
+    @BeforeEach
+    public void setUp(final TestInfo testInfo) throws Exception {
+        final String testId = safeUniqueTestName(testInfo);
+        applicationId = "app-" + testId;
+        inputTopic = "input-" + testId;
+        outputTopic = "output-" + testId;
+        
+        cleanStateBeforeTest(CLUSTER, inputTopic, outputTopic);
+        CLUSTER.createTopics(inputTopic, outputTopic);
+        
+        setupStreamsConfiguration();
+    }
+    
+    @AfterEach
+    public void cleanUp() throws Exception {
+        if (kafkaStreams != null) {
+            kafkaStreams.close();
+        }
+        if (streamsConfiguration != null) {
+            purgeLocalStreamsState(streamsConfiguration);
+        }
+    }
+
+    // failing test
+    @Test
+    public void 
shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLarge()
 throws Exception {
+        
+        try (final LogCaptureAppender appender = 
LogCaptureAppender.createAndRegister(Sender.class)) {
+            produceInputData(LARGE_RECORD_COUNT);
+
+            kafkaStreams = createStreamsApplication();
+            
startApplicationAndWaitUntilRunning(Collections.singletonList(kafkaStreams), 
Duration.ofMillis(DEFAULT_TIMEOUT));

Review Comment:
   How long does this test run? We want to use a reasonable timeout to avoid 
long running tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] KAFKA-19479: at_least_once mode in Kafka Streams silently drops messages when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees [kafka]

Reply via email to