[ https://issues.apache.org/jira/browse/FLINK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156363#comment-17156363 ]
Steven Zhen Wu edited comment on FLINK-11143 at 7/14/20, 5:38 PM: ------------------------------------------------------------------ [~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint with 1.11.0*. The Flink job actually started fine. We didn't see this AskTimeoutException thrown during job submission in without unaligned checkpoint (1.10 or 1.11). Some more context about the app * a large-state stream join app (a few TBs) * parallelism 1,440 * number of containers: 180 * Cores per container: 12 * TM_TASK_SLOTS: 8 * akka.ask.timeout: 120 s * heartbeat.timeout: 120000 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without any difference) I will send you the log files (with DEBUG level) in an email offline. Thanks a lot for your help in advance! {code:java} \"errors\":[\"Internal server error.\",\"<Exception on server side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute application.\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute job 'my-job-alt'.\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t... 10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute job 'my-job-alt'.\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat com.foo.bar.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat com.foo.bar.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat com.foo.bar.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t... 13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time) timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t... 24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.\\n\\tat akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t... 1 more\\n\{code} was (Author: stevenz3wu): [~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint with 1.11.0*. The Flink job actually started fine. We didn't see this AskTimeoutException thrown during job submission in without unaligned checkpoint (1.10 or 1.11). Some more context about the app * a large-state stream join app (a few TBs) * parallelism 1,440 * number of containers: 180 * Cores per container: 12 * TM_TASK_SLOTS: 8 * akka.ask.timeout: 120 s * heartbeat.timeout: 120000 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without any difference) I will send you the log files (with DEBUG level) in an email offline. Thanks a lot for your help in advance! {code:java} \"errors\":[\"Internal server error.\",\"<Exception on server side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute application.\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute job 'personalization-streaming-impressions-alt'.\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t... 10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute job 'personalization-streaming-impressions-alt'.\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat com.foo.bar.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat com.foo.bar.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat com.foo.bar.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t... 13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time) timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t... 24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.\\n\\tat akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t... 1 more\\n\ {code} > AskTimeoutException is thrown during job submission and completion > ------------------------------------------------------------------ > > Key: FLINK-11143 > URL: https://issues.apache.org/jira/browse/FLINK-11143 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.6.2, 1.10.0 > Reporter: Alex Vinnik > Priority: Critical > Attachments: flink-job-timeline.PNG > > > For more details please see the thread > [http://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3cc2fb26f9-1410-4333-80f4-34807481b...@gmail.com%3E] > On submission > 2018-12-12 02:28:31 ERROR JobsOverviewHandler:92 - Implementation error: > Unhandled exception. > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/dispatcher#225683351|#225683351]] after [10000 ms]. > Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748) > > On completion > > {"errors":["Internal server error.","<Exception on server > side:\njava.util.concurrent.CompletionException: > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. > Sender[null] sent message of type > \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\". > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748)\nCaused by: > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. > Sender[null] sent message of type > \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\". > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)\n\t... > 9 more\n\nEnd of exception on server side>"]} -- This message was sent by Atlassian Jira (v8.3.4#803005)