Keith Lee created FLINK-37627:
---------------------------------
Summary: Restarting from a checkpoint/savepoint which coincides
with shard split causes data loss
Key: FLINK-37627
URL: https://issues.apache.org/jira/browse/FLINK-37627
Project: Flink
Issue Type: Bug
Components: Connectors / Kinesis
Affects Versions: aws-connector-5.0.0
Reporter: Keith Lee
Similar to DDB stream connector's issue
https://issues.apache.org/jira/browse/FLINK-37416
This is less likely to happen on Kinesis connector due to much lower frequency
of re-sharding / assigning new split but technically possible so we'd like to
fix this to avoid data
loss.
The scenario is as follow:
- A checkpoint started
- KinesisStreamsSourceEnumerator takes a checkpoint (shard was assigned here)
- KinesisStreamsSourceEnumerator sends checkpoint event to reader
- Before taking reader checkpoint, a SplitFinishedEvent came up in reader
- Reader takes checkpoint
- Now, just after checkpoint complete, job restarted
This can lead to a shard lineage getting lost because of a shard being in
ASSIGNED state in enumerator and not being part of any task manager state.
See DDB Connector issue's PR for reference fix:
https://issues.apache.org/jira/browse/FLINK-37416
--
This message was sent by Atlassian Jira
(v8.20.10#820010)