deepix opened a new issue, #21742:
URL: https://github.com/apache/beam/issues/21742
### What needs to happen?
Under the following conditions, Beam can lose data on Kafka, i.e. it will
mark offsets as processed even when the pipeline has not processed them.
* Checkpointing is enabled: checkpointing_interval is set for Flink runner,
* Kafka consumer config: enable.auto.commit set to true (default),
auto.offset.reset set to latest (default)
* ReadFromKafka(): commit_offset_in_finalize set to false (default)
* No successful checkpoints (i.e. every checkpoint times out)
This can be fixed with the following config:
* commit_offset_in_finalize is true in ReadFromKafka(),
* enable.auto.commit is false in Kafka consumer config
* auto.offset.reset set to none in Kafka consumer config
Therefore, when checkpointing_interval is set, and user calls
ReadFromKafka(), we should check and warn about data loss if any of the above 3
config pieces is incorrect. All of them are required to be set correctly.
More context in this mailing list thread:
https://lists.apache.org/thread/r12sqodn9xojr3fhlkgjyq6sx0phyh7n
### Issue Priority
Priority: 2
### Issue Component
Component: io-java-kafka
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]