[ https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167055#comment-17167055 ]
Till Rohrmann commented on FLINK-16510: --------------------------------------- Thanks [~mxm]. It looks indeed as if the JVM is doing some Garbage collection: {code} 0x00007ff70aacdf93 _ZN9CodeCache8blobs_doEP15CodeBlobClosure + 0x93 0x00007ff70ac2c82d _ZN15G1RootProcessor17process_all_rootsEP10OopClosureP10CLDClosureP15CodeBlobClosure + 0xdd 0x00007ff70ac2305c _ZN11G1MarkSweep17mark_sweep_phase3Ev + 0xbc 0x00007ff70ac231f4 _ZN11G1MarkSweep19invoke_at_safepointEP18ReferenceProcessorb + 0xf4 0x00007ff70ac06d08 _ZN15G1CollectedHeap13do_collectionEbbm.part.286 + 0x558 0x00007ff70ac07735 _ZN15G1CollectedHeap25satisfy_failed_allocationEmhPb + 0x75 0x00007ff70b13b46f _ZN25VM_G1CollectForAllocation4doitEv + 0x7f 0x00007ff70b13a446 _ZN12VM_Operation8evaluateEv + 0x46 0x00007ff70b1386f5 _ZN8VMThread18evaluate_operationEP12VM_Operation + 0xe5 0x00007ff70b138fce _ZN8VMThread4loopEv + 0x3be 0x00007ff70b139428 _ZN8VMThread3runEv + 0xb8 0x00007ff70af6ae62 _ZL10java_startP6Thread + 0x102 {code} It does not seem blocked though from the stack trace. If the problem is reproducible, does it also occur if you change G1 to parallel GC via replacing {{-XX:+UseG1GC}} with {{-XX:+UseParallelGC}}? > Task manager safeguard shutdown may not be reliable > --------------------------------------------------- > > Key: FLINK-16510 > URL: https://issues.apache.org/jira/browse/FLINK-16510 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Reporter: Maximilian Michels > Assignee: Maximilian Michels > Priority: Major > Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt > > > The {{JvmShutdownSafeguard}} does not always succeed but can hang when > multiple threads attempt to shutdown the JVM. Apparently mixing > {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via > {{Runtime.halt()}} does not play together well: > {noformat} > "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x00007fb8e82f2800 > nid=0x5a96 runnable [0x00007fb35cffb000] > java.lang.Thread.State: RUNNABLE > at java.lang.Shutdown.$$YJP$$halt0(Native Method) > at java.lang.Shutdown.halt0(Shutdown.java) > at java.lang.Shutdown.halt(Shutdown.java:139) > - locked <0x000000047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Runtime.halt(Runtime.java:276) > at > org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - None > "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 > os_prio=0 tid=0x00007fb708a7d000 nid=0x5a8a waiting for monitor entry > [0x00007fb289d49000] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.halt(Shutdown.java:139) > - waiting to lock <0x000000047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Shutdown.exit(Shutdown.java:213) > - locked <0x000000047edb7348> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(Runtime.java:110) > at java.lang.System.exit(System.java:973) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x00000006d5e56bd0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > Note that under this condition the JVM should terminate but it still hangs. > Sometimes it quits after several minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)