Keith Lee created FLINK-37808:
---------------------------------
Summary: Checkpoint completed after job failure
Key: FLINK-37808
URL: https://issues.apache.org/jira/browse/FLINK-37808
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.18.0
Reporter: Keith Lee
We found a case where checkpoint was marked as completed after job failure (due
to loss of leadership). The checkpoint was subsequently used for automatic
recovery, is this by design? Could it have caused issue in jobs with two phase
commit sinks?
1. Checkpoint was triggered.
```
2025-04-09T10:26:31.077Z Triggering checkpoint 3270594
(type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD})
@ 1744194390986 for job REDACTED.
```
2. JobManager lost leadership
```
2025-04-09T10:26:32.868Z Closing TaskExecutor connection
10.99.68.36:6122-db3d6b because: ResourceManager leader changed to new address
null
...
2025-04-09T10:26:33.940Z Disconnect TaskExecutor 10.99.68.36:6122-db3d6b
because: Job leader for job id REDACTED lost leadership.
```
3. Job failed and restarting
```
2025-04-09T10:26:33.982Z Job Flink Streaming Job (REDACTED) switched from
state RUNNING to RESTARTING.
```
4. Checkpoint 3270594 was unexpectedly marked as completed instead of failed
```
2025-04-09T10:26:34.719Z Completed checkpoint 3270594 for job REDACTED
(346358222 bytes, checkpointDuration=2605 ms, finalizationTime=1127 ms).
```
5. Job was then restored from checkpoint which should have failed.
```
2025-04-09T10:26:44.880Z Restoring job REDACTED from Checkpoint 3270594
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)