[ 
https://issues.apache.org/jira/browse/FLINK-24113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459414#comment-17459414
 ] 

Robert Metzger commented on FLINK-24113:
----------------------------------------

Thanks a lot for addressing this feature request [~chesnay] and [~Nicolaus 
Weidner].

While using it, I observed that the cluster shutdown sometimes gets stuck, if 
triggered by the REST API. It works when the cluster shutdown is initiated by a 
job cancellation (in Application Mode), I haven't observed this issue yet.

Here's where I believe the shutdown gets stuck:
{code}
"AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #89 daemon 
prio=5 os_prio=0 tid=0x0000004017d70000 nid=0x2ec in Object.wait() 
[0x000000402a9b5000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d6c48368> (a 
org.apache.flink.runtime.blob.BlobServer)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000000d6c48368> (a 
org.apache.flink.runtime.blob.BlobServer)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:319)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406)
        - locked <0x00000000d5d27350> (a java.lang.Object)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint$$Lambda$1113/1220951830.get(Unknown
 Source)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$composeAfterwards$20(FutureUtils.java:728)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$1083/1178655216.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$null$19(FutureUtils.java:739)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$1088/1499303232.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent.lambda$closeAsyncInternal$2(DispatcherResourceManagerComponent.java:198)
        at 
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent$$Lambda$1133/525033897.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture.completeFuture(FutureUtils.java:1000)
        - locked <0x00000000c14d6000> (a java.lang.Object)
        at 
org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture$$Lambda$544/1791014677.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$599/1004862656.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$599/1004862656.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$589/953925250.run(Unknown
 Source)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(ClassLoadingUtils.java:41)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$585/1952194564.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

I'll attach the full log and continue investigating. Once we've understood the 
issue, I'm happy to create a separate ticket.

> Introduce option in Application Mode to disable shutdown
> --------------------------------------------------------
>
>                 Key: FLINK-24113
>                 URL: https://issues.apache.org/jira/browse/FLINK-24113
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Robert Metzger
>            Assignee: Nicolaus Weidner
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>         Attachments: shutdown_issue.log
>
>
> Currently a Flink JobManager started in Application Mode will shut down once 
> the job has completed.
> When doing a "stop with savepoint" operation, we want to keep the JobManager 
> alive after the job has stopped to retrieve and persist the final savepoint 
> location.
> Currently, Flink waits up to 5 minutes and then shuts down.
> I'm proposing to introduce a new configuration flag "application mode 
> shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep 
> the JobManager in ApplicationMode running until a REST call confirms that it 
> can shutdown.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to