Re: Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook
I'd try to increase the value, so that the timeout doesn't happen during the shutdown. Regards, Roman On Fri, Apr 8, 2022 at 7:50 PM Alexey Trenikhun wrote: > > Hi Roman, > Currently rest.async.store-duration is not set. Are you suggesting to try to > decrease value from default or vice-versa? > > Thanks, > Alexey > > From: Roman Khachatryan > Sent: Friday, April 8, 2022 5:32:45 AM > To: Alexey Trenikhun > Cc: Flink User Mail List > Subject: Re: Error during shutdown of StandaloneApplicationClusterEntryPoint > via JVM shutdown hook > > Hello, > > Unfortunately, it's difficult to name the exact reason why the timeout > happens because there's no message logged. > I've opened a ticket to improve the logging [1]. > There, I also listed some code paths that might lead to this situation. > > From the described scenario, I'd suppose that it's > CompletedOperationCache.closeAsync() that times out. It can be > verified or maybe mitigated by changing rest.async.store-duration [2] > (the default is 5 minutes). > Could you check that? > > [1] > https://issues.apache.org/jira/browse/FLINK-27144 > [2] > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rest-async-store-duration > > Regards, > Roman > > On Wed, Apr 6, 2022 at 5:21 PM Alexey Trenikhun wrote: > > > > Hello, > > > > We are using Flink 1.13.6, Application Mode, k8s HA. To upgrade job, we use > > POST, > > url=http://gsp-jm:8081/jobs//savepoints, > > then we wait for up to 5 minutes for completion, periodically pulling > > status (GET, > > url=http://gsp-jm:8081/jobs//savepoints/{triggerId}). > > If savepoint is not complete in 5 minute, we cancel job (PATCH, > > url=http://gsp-jm:8081/jobs/000). Usually it > > works well, job stopped one way or another and we proceed with upgrade, but > > currently JM exits with code -2, and as result k8s restarts pod. We tried > > multiple times, but every time getting -2. JM log is below (newest messages > > on top): > > > > 2022-04-06T14:21:17.465Z Error during shutdown of > > StandaloneApplicationClusterEntryPoint via JVM shutdown hook. > > java.util.concurrent.CompletionException: > > java.util.concurrent.TimeoutException > > at > > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > > at > > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > > at > > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) > > at > > java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) > > at > > java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) > > at > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.closeAsync(ClusterEntrypoint.java:379) > > at > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$new$0(ClusterEntrypoint.java:168) > > at > > org.apache.flink.util.ShutdownHookUtil.lambda$addShutdownHook$0(ShutdownHookUtil.java:39) > > at java.lang.Thread.run(Thread.java:750) Caused by: > > java.util.concurrent.TimeoutException: null > > at > > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > > at > > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > > at > > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > ... 1 common frames omitted > > 2022-04-06T14:21:17.464Z Terminating cluster entrypoint process > > StandaloneApplicationClusterEntryPoint with exit code 2. > > java.util.concurrent.TimeoutException: null > > at > > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > > at > > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecuto
Re: Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook
Hi Roman, Currently rest.async.store-duration is not set. Are you suggesting to try to decrease value from default or vice-versa? Thanks, Alexey From: Roman Khachatryan Sent: Friday, April 8, 2022 5:32:45 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook Hello, Unfortunately, it's difficult to name the exact reason why the timeout happens because there's no message logged. I've opened a ticket to improve the logging [1]. There, I also listed some code paths that might lead to this situation. >From the described scenario, I'd suppose that it's CompletedOperationCache.closeAsync() that times out. It can be verified or maybe mitigated by changing rest.async.store-duration [2] (the default is 5 minutes). Could you check that? [1] https://issues.apache.org/jira/browse/FLINK-27144 [2] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rest-async-store-duration Regards, Roman On Wed, Apr 6, 2022 at 5:21 PM Alexey Trenikhun wrote: > > Hello, > > We are using Flink 1.13.6, Application Mode, k8s HA. To upgrade job, we use > POST, > url=http://gsp-jm:8081/jobs//savepoints, then > we wait for up to 5 minutes for completion, periodically pulling status (GET, > url=http://gsp-jm:8081/jobs//savepoints/{triggerId}). > If savepoint is not complete in 5 minute, we cancel job (PATCH, > url=http://gsp-jm:8081/jobs/000). Usually it > works well, job stopped one way or another and we proceed with upgrade, but > currently JM exits with code -2, and as result k8s restarts pod. We tried > multiple times, but every time getting -2. JM log is below (newest messages > on top): > > 2022-04-06T14:21:17.465Z Error during shutdown of > StandaloneApplicationClusterEntryPoint via JVM shutdown hook. > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) > at > java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) > at > java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.closeAsync(ClusterEntrypoint.java:379) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$new$0(ClusterEntrypoint.java:168) > at > org.apache.flink.util.ShutdownHookUtil.lambda$addShutdownHook$0(ShutdownHookUtil.java:39) > at java.lang.Thread.run(Thread.java:750) Caused by: > java.util.concurrent.TimeoutException: null > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ... 1 common frames omitted > 2022-04-06T14:21:17.464Z Terminating cluster entrypoint process > StandaloneApplicationClusterEntryPoint with exit code 2. > java.util.concurrent.TimeoutException: null > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.
Re: Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook
Hello, Unfortunately, it's difficult to name the exact reason why the timeout happens because there's no message logged. I've opened a ticket to improve the logging [1]. There, I also listed some code paths that might lead to this situation. >From the described scenario, I'd suppose that it's CompletedOperationCache.closeAsync() that times out. It can be verified or maybe mitigated by changing rest.async.store-duration [2] (the default is 5 minutes). Could you check that? [1] https://issues.apache.org/jira/browse/FLINK-27144 [2] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rest-async-store-duration Regards, Roman On Wed, Apr 6, 2022 at 5:21 PM Alexey Trenikhun wrote: > > Hello, > > We are using Flink 1.13.6, Application Mode, k8s HA. To upgrade job, we use > POST, > url=http://gsp-jm:8081/jobs//savepoints, then > we wait for up to 5 minutes for completion, periodically pulling status (GET, > url=http://gsp-jm:8081/jobs//savepoints/{triggerId}). > If savepoint is not complete in 5 minute, we cancel job (PATCH, > url=http://gsp-jm:8081/jobs/000). Usually it > works well, job stopped one way or another and we proceed with upgrade, but > currently JM exits with code -2, and as result k8s restarts pod. We tried > multiple times, but every time getting -2. JM log is below (newest messages > on top): > > 2022-04-06T14:21:17.465Z Error during shutdown of > StandaloneApplicationClusterEntryPoint via JVM shutdown hook. > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) > at > java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) > at > java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.closeAsync(ClusterEntrypoint.java:379) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$new$0(ClusterEntrypoint.java:168) > at > org.apache.flink.util.ShutdownHookUtil.lambda$addShutdownHook$0(ShutdownHookUtil.java:39) > at java.lang.Thread.run(Thread.java:750) Caused by: > java.util.concurrent.TimeoutException: null > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ... 1 common frames omitted > 2022-04-06T14:21:17.464Z Terminating cluster entrypoint process > StandaloneApplicationClusterEntryPoint with exit code 2. > java.util.concurrent.TimeoutException: null > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > 2022-04-06T14:21:17.463Z Stopped Akka RPC service. > 2022-04-06T14:21:17.46Z Stopped Akka RPC service. > 2022-04-06T14:21:17.373Z Remoting shut down. > 2022-04-06T14:21:17.373Z Remoting shut down. > 2022-04-06T14:21:17.297Z Remote daemon shut down; proceedin
Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook
Hello, We are using Flink 1.13.6, Application Mode, k8s HA. To upgrade job, we use POST, url=http://gsp-jm:8081/jobs//savepoints, then we wait for up to 5 minutes for completion, periodically pulling status (GET, url=http://gsp-jm:8081/jobs//savepoints/{triggerId}). If savepoint is not complete in 5 minute, we cancel job (PATCH, url=http://gsp-jm:8081/jobs/000). Usually it works well, job stopped one way or another and we proceed with upgrade, but currently JM exits with code -2, and as result k8s restarts pod. We tried multiple times, but every time getting -2. JM log is below (newest messages on top): 2022-04-06T14:21:17.465Z Error during shutdown of StandaloneApplicationClusterEntryPoint via JVM shutdown hook. java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.closeAsync(ClusterEntrypoint.java:379) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$new$0(ClusterEntrypoint.java:168) at org.apache.flink.util.ShutdownHookUtil.lambda$addShutdownHook$0(ShutdownHookUtil.java:39) at java.lang.Thread.run(Thread.java:750) Caused by: java.util.concurrent.TimeoutException: null at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 common frames omitted 2022-04-06T14:21:17.464Z Terminating cluster entrypoint process StandaloneApplicationClusterEntryPoint with exit code 2. java.util.concurrent.TimeoutException: null at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2022-04-06T14:21:17.463Z Stopped Akka RPC service. 2022-04-06T14:21:17.46Z Stopped Akka RPC service. 2022-04-06T14:21:17.373Z Remoting shut down. 2022-04-06T14:21:17.373Z Remoting shut down. 2022-04-06T14:21:17.297Z Remote daemon shut down; proceeding with flushing remote transports. 2022-04-06T14:21:17.297Z Remote daemon shut down; proceeding with flushing remote transports. 2022-04-06T14:21:17.296Z Shutting down remote daemon. 2022-04-06T14:21:17.296Z Shutting down remote daemon. 2022-04-06T14:21:17.286Z Stopping Akka RPC service. 2022-04-06T14:21:17.281Z Stopping Akka RPC service. 2022-04-06T14:21:17.277Z Finished cleaning up the high availability data. 2022-04-06T14:21:14.04Z Close and clean up all data for KubernetesHaServices. 2022-04-06T14:21:14.04Z Stopped BLOB server at 0.0.0.0:6124 2022-04-06T14:21:14.039Z Stopping DefaultJobGraphStore. 2022-04-06T14:21:14.037Z The watcher is closing. 2022-04-06T14:21:14.037Z Closing KubernetesLeaderElectionDriver{configMapName='gsp-resourcemanager-leader'}. 2022-04-06T14:21:14.037Z Stopping DefaultLeaderElectionService. 2022-04-06T14:21:14.036Z Stopped dispatcher akka.tcp://flink@10.204.19.239:6123/user/rpc/dispatcher_1. 2022-04-06T14:21:14.035Z Suspending the slot manager. 2022-04-06T14:21:14.035Z Closing the slot manager. 2022-04-06T14:21:14.035Z Stopping all currently running jobs of dispatcher