savepoint which coincides with shard split causes data loss [flink-connector-aws]

via GitHub Fri, 12 Dec 2025 12:23:44 -0800


gguptp opened a new pull request, #222:
URL: https://github.com/apache/flink-connector-aws/pull/222


   …
   
   <!--
   *Thank you for contributing to Apache Flink AWS Connectors - we are happy 
that you want to help us improve our Flink connectors. To help the community 
review your contribution in the best possible way, please go through the 
checklist below, which will get the contribution into a shape in which it can 
be best reviewed.*
   
   ## Contribution Checklist
   
   - The name of the pull request should correspond to a [JIRA 
issue](https://issues.apache.org/jira/projects/FLINK/issues). Exceptions are 
made for typos in JavaDoc or documentation files, which need no JIRA issue.
   - Commits should be in the form of "[FLINK-XXXX][component] Title of the 
pull request", where [FLINK-XXXX] should be replaced by the actual issue 
number. 
       Generally, [component] should be the connector you are working on.
       For example: "[FLINK-XXXX][Connectors/Kinesis] XXXX" if you are working 
on the Kinesis connector or "[FLINK-XXXX][Connectors/AWS] XXXX" if you are 
working on components shared among all the connectors.
   - Each pull request should only have one JIRA issue.
   - Once all items of the checklist are addressed, remove the above text and 
this checklist, leaving only the filled out template below.
   -->
   
   ## Purpose of the change
   
   
[[FLINK-37627](https://issues.apache.org/jira/browse/FLINK-37627)][BugFix][Connectors/Kinesis]
 Restarting from a checkpoint/savepoint which coincides with shard split causes 
data loss
   
   This PR updates the following PR: 
https://github.com/apache/flink-connector-aws/pull/198
   
   Today Flink does not support distributed consistency of events from subtask 
(Task Manager) to coordinator (Job Manager) - 
https://issues.apache.org/jira/browse/FLINK-28639. As a result we have a race 
condition that can lead to a shard and it's children shards stopped being 
processed after a job restart.
   
   - A checkpoint started
   -  Enumerator took a checkpoint (shard was assigned here)
   -  Enumerator sent checkpoint event to reader
   -  Before taking reader checkpoint, a SplitFinishedEvent came up in reader
   - Reader took checkpoint
   - Now, just after checkpoint complete, job restarted
   
   This can lead to a shard lineage getting lost because of a shard being in 
ASSIGNED state in enumerator and not being part of any task manager state.
   This PR changes the behaviour by also checkpointing the finished splits 
events received in between two checkpoints and on restore, those events again 
getting replayed.
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follows the 
conventions defined in our code quality guide: 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
   
   *(Please pick either of the following options)*
   
   - Added UTs 
   - I manually verified this by running the connector in a local flink cluster 
which was getting restarted every 10 minutes. No checkpoint inconsistency was 
observed
   
   *(example:)*
   - *Added integration tests for end-to-end deployment*
   - *Added unit tests*
   - *Manually verified by running the Kinesis connector on a local Flink 
cluster.*
   
   ## Significant changes
   *(Please check any boxes [x] if the answer is "yes". You can first publish 
the PR and check them afterwards, for convenience.)*
   - [ ] Dependencies have been added or upgraded
   - [ ] Public API has been changed (Public API is any class annotated with 
`@Public(Evolving)`)
   - [ X ] Serializers have been changed
   - [ ] New feature has been introduced
     - If yes, how is this documented? (not applicable / docs / JavaDocs / not 
documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-37627][BugFix][Connectors/Kinesis] Restarting from a checkpoint/savepoint which coincides with shard split causes data loss [flink-connector-aws]

Reply via email to