AHeise opened a new pull request, #192:
URL: https://github.com/apache/flink-connector-kafka/pull/192

   KafkaEnumerator's state contains the TopicPartitions only but not the 
offsets, so it doesn't contain the full split state contrary to the design 
intent.
   
   There are a couple of issues with that approach. It implicitly assumes that 
splits are fully assigned to readers before the first checkpoint. Else the 
enumerator will invoke the offset initializer again on recovery from such a 
checkpoint leading to inconsistencies (LATEST may be initialized during the 
first attempt for some partitions and initialized during second attempt for 
others).
   
   Through addSplitBack callback, you may also get these scenarios later for 
BATCH which actually leads to duplicate rows (in case of EARLIEST or 
SPECIFIC-OFFSETS) or data loss (in case of LATEST). Finally, it's not possible 
to safely use KafkaSource as part of a HybridSource because the offset 
initializer cannot even be recreated on recovery.
   
   All cases are solved by also retaining the offset in the enumerator state. 
To that end, this commit merges the async discovery phases to immediately 
initialize the splits from the partitions. Any subsequent checkpoint will 
contain the proper start offset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to