Re: [PR] ReadFromKafkaDoFn: handle failed seek [beam]

via GitHub Mon, 16 Sep 2024 00:57:00 -0700


scwhittle commented on code in PR #32456:
URL: https://github.com/apache/beam/pull/32456#discussion_r1760688103



##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java:
##########
@@ -461,6 +461,11 @@ public ProcessContinuation processElement(
           return ProcessContinuation.resume();
         }
         for (ConsumerRecord<byte[], byte[]> rawRecord : rawRecords) {
+          // If the Kafka consumer returns a record with an offset that is 
already processed

Review Comment:
   Maybe comment that the seek above may not be entirely successful and thus 
this is needed?



##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java:
##########
@@ -461,6 +461,11 @@ public ProcessContinuation processElement(
           return ProcessContinuation.resume();
         }
         for (ConsumerRecord<byte[], byte[]> rawRecord : rawRecords) {
+          // If the Kafka consumer returns a record with an offset that is 
already processed
+          // the record can be safely skipped.
+          if (rawRecord.offset() < startOffset) {
+            continue;

Review Comment:
   I'm concerned that if we have some case where seek just stops working in 
general this will just end up scanning a bunch of records and be hard to read.  
Should we count and log how many skipped records due to the seek failing? Or we 
could handle differently like attempting to seek again or returning 
ProcessContinuation.resume() to try the seek again later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ReadFromKafkaDoFn: handle failed seek [beam]

Reply via email to