[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-19 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230
 ] 

Matthias Pohl commented on FLINK-35639:
---

[~chesnay] pointed me to the actual issue because I was wondering why the 
change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The 
problem is that you're actually not following the supported process (as 
documented in the [Flink 
docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]).
 That results in incompatibilities of internal APIs (the constructor in 
question is package-private). Please use savepoints to migrate jobs. There are 
other internal APIs (the JobGraph itself isn't a stable API, either) that might 
cause problems in your upgrade process.
 # Create a savepoint of the job in the old version.
 # Start the Flink cluster with the upgraded Flink version.
 # Submit the job using the created savepoint to restart the job.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> 

[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-18 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855979#comment-17855979
 ] 

Matthias Pohl commented on FLINK-35639:
---

I guess, you're right. Looks like there was an error being made when 
deprecating the {{@PublicEvolving}} API of 
{{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise 
the priority for this one to blocker because it also affects 1.20.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Priority: Major
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
>