[jira] [Commented] (FLINK-32520) FlinkDeployment recovered states from an obsolete savepoint when performing an upgrade

Ruibin Xing (Jira) Tue, 04 Jul 2023 03:17:41 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739872#comment-17739872
 ]


Ruibin Xing commented on FLINK-32520:
-------------------------------------

[~gyfora]  I haven't been able to reproduce this yet. However, I can provide 
the entire logs during the restart. (see: logs-06151328-06151332.csv),
The name of the deployment is: octopus-flink-octopus-data-proces-8936e.

The times I submitted the second and third upgrades are 2023/06/15 05:30:02 
+0000 and 2023/06/15 05:30:25 +0000. They might not match the wall clock of the 
logs exactly, though.

> FlinkDeployment recovered states from an obsolete savepoint when performing 
> an upgrade
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-32520
>                 URL: https://issues.apache.org/jira/browse/FLINK-32520
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.13.1
>            Reporter: Ruibin Xing
>            Priority: Major
>         Attachments: flink_kubernetes_operator_0615.csv, 
> logs-06151328-06151332.csv
>
>
> Kubernetes Operator version: 1.5.0
>  
> When upgrading one of our Flink jobs, it recovered from a savepoint created 
> by the previous version of the job. The timeline of the job is as follows:
>  # I upgraded the job for the first time. The job created a savepoint and 
> successfully restored from it.
>  # The job was running fine and created several checkpoints.
>  # Later, I performed the second upgrade. Soon after submission and before 
> the JobManager stopped, I realized I made a mistake in the spec, so I quickly 
> did the third upgrade.
>  # After the job started, I found that it had recovered from the savepoint 
> created during the first upgrade.
>  
> It appears that there was an error when submitting the third upgrade. 
> However, I'm still not quite sure why this would cause Flink to use the 
> obsolete savepoint after investigating the code. The related logs for the 
> operator are attached below.
>  
> Although I haven't found the root cause, I came up with some possible fixes:
>  # Remove the {{lastSavepoint}} after a job has successfully restored from it.
>  # Add options for savepoint, similar to: 
> {{kubernetes.operator.job.upgrade.last-state.max.allowed.checkpoint.age}} The 
> operator should refuse to recover from the savepoint if the max age is 
> exceeded.
>  # Create a flag in the status that records savepoint states. Set the flag to 
> false when the savepoint starts and mark it as true when it successfully 
> ends. The job should report an error if the flag for the last savepoint is 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32520) FlinkDeployment recovered states from an obsolete savepoint when performing an upgrade

Reply via email to