Re: Job Cancellation Failing
I noticed a test instability that sounds quite similar to what you're experiencing. I created FLINK-31168 [1] to follow-up on this one. [1] https://issues.apache.org/jira/browse/FLINK-31168 On Mon, Feb 20, 2023 at 4:50 PM Matthias Pohl wrote: > What do you mean by "earlier it used to fail due to ExecutionGraphStore > not existing in /tmp" folder? Did you get the error message "Could not > create executionGraphStorage directory in /tmp." and creating this folder > fixed the issue? > > It also looks like the stacktrace doesn't match any of the 1.15 versions > in terms of line numbers. Or I might miss something here. Could you provide > the exact Flink version you're using? > > I might also help to share the JobManager logs to understand the context > in which the cancel operation was triggered. > > Matthias > > On Mon, Feb 20, 2023 at 1:53 AM Puneet Duggal > wrote: > >> Flink Cluster Context: >> >> >>- Flink Version - 1.15 >>- Deployment Mode - Session >>- Number of Job Managers - 3 (HA) >>- Number of Task Managers - 1 >> >> >> Cancellation of Job fails due to following >> >> org.apache.flink.runtime.rest.NotFoundException: Job >> 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found >> at >> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99) >> at >> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) >> at >> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) >> at >> org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109) >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) >> at >> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252) >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) >> at >> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387) >> at >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) >> at >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) >> at >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) >> at >> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45) >> at akka.dispatch.OnComplete.internal(Future.scala:299) >> at akka.dispatch.OnComplete.internal(Future.scala:297) >> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224) >> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) >> at >> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65) >> at >> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) >> at >> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) >> at >> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) >> at >> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) >> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621) >> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:118) >> at >> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:1144) >> at akka.actor.Actor.aroundReceive(Actor.scala:537) >> at akka.actor.Actor.aroundReceive$(Actor.scala:535) >> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:540) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) >> at
Re: Job Cancellation Failing
What do you mean by "earlier it used to fail due to ExecutionGraphStore not existing in /tmp" folder? Did you get the error message "Could not create executionGraphStorage directory in /tmp." and creating this folder fixed the issue? It also looks like the stacktrace doesn't match any of the 1.15 versions in terms of line numbers. Or I might miss something here. Could you provide the exact Flink version you're using? I might also help to share the JobManager logs to understand the context in which the cancel operation was triggered. Matthias On Mon, Feb 20, 2023 at 1:53 AM Puneet Duggal wrote: > Flink Cluster Context: > > >- Flink Version - 1.15 >- Deployment Mode - Session >- Number of Job Managers - 3 (HA) >- Number of Task Managers - 1 > > > Cancellation of Job fails due to following > > org.apache.flink.runtime.rest.NotFoundException: Job > 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found > at > org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99) > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45) > at akka.dispatch.OnComplete.internal(Future.scala:299) > at akka.dispatch.OnComplete.internal(Future.scala:297) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621) > at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:118) > at > akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:1144) > at akka.actor.Actor.aroundReceive(Actor.scala:537) > at akka.actor.Actor.aroundReceive$(Actor.scala:535) > at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:540) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) > at akka.actor.ActorCell.invoke(ActorCell.scala:548) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) > at akka.dispatch.Mailbox.run(Mailbox.scala:231) > at akka.dispatch.Mailbox.exec(Mailbox.scala:243) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > at > java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) > at
Job Cancellation Failing
Flink Cluster Context: Flink Version - 1.15 Deployment Mode - Session Number of Job Managers - 3 (HA) Number of Task Managers - 1 Cancellation of Job fails due to following org.apache.flink.runtime.rest.NotFoundException: Job 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45) at akka.dispatch.OnComplete.internal(Future.scala:299) at akka.dispatch.OnComplete.internal(Future.scala:297) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:118) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:1144) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:540) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find Flink job (1cb2185d4d72c8c6f0a3a549d7de4ef0) at org.apache.flink.runtime.dispatcher.Dispatcher.requestExecutionGraphInfo(Dispatcher.java:812) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:304) at