[ https://issues.apache.org/jira/browse/FLINK-27274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523625#comment-17523625 ]
Zhu Zhu commented on FLINK-27274: --------------------------------- Sorry but from what I see, Flink does not guarantee all jobs to be SUSPENDED in the stop-cluster.sh case. In the stop-cluster.sh script which has not been changed for years, it stops all task managers before stopping the job manager. This means it is possible that a job fails before the JM gets shut down. One workaround I can think of for your case is to use a scripts similar to stop-cluster.sh, but with a bit differences: it first stops the jobmanager, then waits for some time, and finally stops all the task managers. > Job cannot be recovered, after restarting cluster > ------------------------------------------------- > > Key: FLINK-27274 > URL: https://issues.apache.org/jira/browse/FLINK-27274 > Project: Flink > Issue Type: Bug > Components: Table SQL / API > Affects Versions: 1.15.0 > Environment: Flink 1.15.0-rc3 > [https://github.com/apache/flink/archive/refs/tags/release-1.15.0-rc3.tar.gz] > Reporter: macdoor615 > Priority: Blocker > Fix For: 1.15.1 > > Attachments: flink-conf.yaml, > flink-gum-standalonesession-0-hb3-dev-flink-000.log.3.zip, > flink-gum-standalonesession-0-hb3-dev-flink-000.log.zip, > flink-gum-taskexecutor-2-hb3-dev-flink-000.log, log.recover.debug.zip, > new_cf_alarm_no_recover.yaml.sql > > > 1. execute new_cf_alarm_no_recover.yaml.sql with sql-client.sh > config file: flink-conf.yaml > the job run properly > 2. restart cluster with command > stop-cluster.sh > start-cluster.sh > 3. job cannot be recovered > log files > flink-gum-standalonesession-0-hb3-dev-flink-000.log > flink-gum-taskexecutor-2-hb3-dev-flink-000.log > 4. not all job can not be recovered, some can, some can not, at same time > 5. all job can be recovered on Flink 1.14.4 -- This message was sent by Atlassian Jira (v8.20.1#820001)