[jira] [Created] (FLINK-37627) Restarting from a checkpoint/savepoint which coincides with shard split causes data loss

Keith Lee (Jira) Mon, 07 Apr 2025 05:40:22 -0700

Keith Lee created FLINK-37627:
---------------------------------

             Summary: Restarting from a checkpoint/savepoint which coincides 
with shard split causes data loss
                 Key: FLINK-37627
                 URL: https://issues.apache.org/jira/browse/FLINK-37627
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: aws-connector-5.0.0
            Reporter: Keith Lee



Similar to DDB stream connector's issue 
https://issues.apache.org/jira/browse/FLINK-37416

This is less likely to happen on Kinesis connector due to much lower frequency 
of re-sharding / assigning new split but technically possible so we'd like to 
fix this to avoid data
loss.

The scenario is as follow:

- A checkpoint started
- KinesisStreamsSourceEnumerator takes a checkpoint (shard was assigned here)
- KinesisStreamsSourceEnumerator sends checkpoint event to reader
- Before taking reader checkpoint, a SplitFinishedEvent came up in reader
- Reader takes checkpoint
- Now, just after checkpoint complete, job restarted

This can lead to a shard lineage getting lost because of a shard being in 
ASSIGNED state in enumerator and not being part of any task manager state.

See DDB Connector issue's PR for reference fix: 
https://issues.apache.org/jira/browse/FLINK-37416



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-37627) Restarting from a checkpoint/savepoint which coincides with shard split causes data loss

Reply via email to