Dippatel98 commented on code in PR #32456:
URL: https://github.com/apache/beam/pull/32456#discussion_r1761444351
##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java:
##########
@@ -461,6 +461,11 @@ public ProcessContinuation processElement(
return ProcessContinuation.resume();
}
for (ConsumerRecord<byte[], byte[]> rawRecord : rawRecords) {
+ // If the Kafka consumer returns a record with an offset that is
already processed
+ // the record can be safely skipped.
+ if (rawRecord.offset() < startOffset) {
+ continue;
Review Comment:
Yeah, that makes sense. But I am not super familiar with
`ProcessContinuation.resumer()`. From the docs it seems that the DoFn will
begin processing the same element again, does that not cause an infinite loop
since the element is already processed.
Maybe if the processing begins from the beginning of the DoFn and we seek
again then maybe we can get a different result.
##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java:
##########
@@ -461,6 +461,11 @@ public ProcessContinuation processElement(
return ProcessContinuation.resume();
}
for (ConsumerRecord<byte[], byte[]> rawRecord : rawRecords) {
+ // If the Kafka consumer returns a record with an offset that is
already processed
+ // the record can be safely skipped.
+ if (rawRecord.offset() < startOffset) {
+ continue;
Review Comment:
Yeah, that makes sense. I am not super familiar with how
`ProcessContinuation` works, the docs say that it begins processing the DoFn at
the same element. Would that not result in an infinite loop, since we have
already processed that element.
Or does the processing begin at the beginning of the DoFn, which would mean
we seek again and maybe get the records that we need this time? If that is the
case, does it make sense to resume with a delay? And also instead of doing this
here we can do it immediately after the seek by checking the position.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]