[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230 ] Matthias Pohl commented on FLINK-35639: --- [~chesnay] pointed me to the actual issue because I was wondering why the change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The problem is that you're actually not following the supported process (as documented in the [Flink docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]). That results in incompatibilities of internal APIs (the constructor in question is package-private). Please use savepoints to migrate jobs. There are other internal APIs (the JobGraph itself isn't a stable API, either) that might cause problems in your upgrade process. # Create a savepoint of the job in the old version. # Start the Flink cluster with the upgraded Flink version. # Submit the job using the created savepoint to restart the job. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at >
[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855979#comment-17855979 ] Matthias Pohl commented on FLINK-35639: --- I guess, you're right. Looks like there was an error being made when deprecating the {{@PublicEvolving}} API of {{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise the priority for this one to blocker because it also affects 1.20. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Priority: Major > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at >