scwhittle commented on code in PR #32456:
URL: https://github.com/apache/beam/pull/32456#discussion_r1760688103
##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java:
##########
@@ -461,6 +461,11 @@ public ProcessContinuation processElement(
return ProcessContinuation.resume();
}
for (ConsumerRecord<byte[], byte[]> rawRecord : rawRecords) {
+ // If the Kafka consumer returns a record with an offset that is
already processed
Review Comment:
Maybe comment that the seek above may not be entirely successful and thus
this is needed?
##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java:
##########
@@ -461,6 +461,11 @@ public ProcessContinuation processElement(
return ProcessContinuation.resume();
}
for (ConsumerRecord<byte[], byte[]> rawRecord : rawRecords) {
+ // If the Kafka consumer returns a record with an offset that is
already processed
+ // the record can be safely skipped.
+ if (rawRecord.offset() < startOffset) {
+ continue;
Review Comment:
I'm concerned that if we have some case where seek just stops working in
general this will just end up scanning a bunch of records and be hard to read.
Should we count and log how many skipped records due to the seek failing? Or we
could handle differently like attempting to seek again or returning
ProcessContinuation.resume() to try the seek again later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]