[ https://issues.apache.org/jira/browse/FLINK-24113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459414#comment-17459414 ]
Robert Metzger commented on FLINK-24113: ---------------------------------------- Thanks a lot for addressing this feature request [~chesnay] and [~Nicolaus Weidner]. While using it, I observed that the cluster shutdown sometimes gets stuck, if triggered by the REST API. It works when the cluster shutdown is initiated by a job cancellation (in Application Mode), I haven't observed this issue yet. Here's where I believe the shutdown gets stuck: {code} "AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #89 daemon prio=5 os_prio=0 tid=0x0000004017d70000 nid=0x2ec in Object.wait() [0x000000402a9b5000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000d6c48368> (a org.apache.flink.runtime.blob.BlobServer) at java.lang.Thread.join(Thread.java:1252) - locked <0x00000000d6c48368> (a org.apache.flink.runtime.blob.BlobServer) at java.lang.Thread.join(Thread.java:1326) at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:319) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406) - locked <0x00000000d5d27350> (a java.lang.Object) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint$$Lambda$1113/1220951830.get(Unknown Source) at org.apache.flink.util.concurrent.FutureUtils.lambda$composeAfterwards$20(FutureUtils.java:728) at org.apache.flink.util.concurrent.FutureUtils$$Lambda$1083/1178655216.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.util.concurrent.FutureUtils.lambda$null$19(FutureUtils.java:739) at org.apache.flink.util.concurrent.FutureUtils$$Lambda$1088/1499303232.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent.lambda$closeAsyncInternal$2(DispatcherResourceManagerComponent.java:198) at org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent$$Lambda$1133/525033897.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture.completeFuture(FutureUtils.java:1000) - locked <0x00000000c14d6000> (a java.lang.Object) at org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture$$Lambda$544/1791014677.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389) at org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372) at org.apache.flink.util.concurrent.FutureUtils$$Lambda$599/1004862656.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389) at org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372) at org.apache.flink.util.concurrent.FutureUtils$$Lambda$599/1004862656.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$589/953925250.run(Unknown Source) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(ClassLoadingUtils.java:41) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$585/1952194564.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I'll attach the full log and continue investigating. Once we've understood the issue, I'm happy to create a separate ticket. > Introduce option in Application Mode to disable shutdown > -------------------------------------------------------- > > Key: FLINK-24113 > URL: https://issues.apache.org/jira/browse/FLINK-24113 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.15.0 > Reporter: Robert Metzger > Assignee: Nicolaus Weidner > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > Attachments: shutdown_issue.log > > > Currently a Flink JobManager started in Application Mode will shut down once > the job has completed. > When doing a "stop with savepoint" operation, we want to keep the JobManager > alive after the job has stopped to retrieve and persist the final savepoint > location. > Currently, Flink waits up to 5 minutes and then shuts down. > I'm proposing to introduce a new configuration flag "application mode > shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep > the JobManager in ApplicationMode running until a REST call confirms that it > can shutdown. -- This message was sent by Atlassian Jira (v8.20.1#820001)