[ 
https://issues.apache.org/jira/browse/FLINK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748254#comment-17748254
 ] 

Matthias Pohl edited comment on FLINK-31168 at 7/28/23 6:46 AM:
----------------------------------------------------------------

The error appears in the 
[FileExecutionGraphInfoStore#getAvailableJobDetails(JobID)|https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/FileExecutionGraphInfoStore.java#L237]
 where the JobDetails are not cached. The {{FileSystemExecutionGraphInfoStore}} 
uses two caches:
* {{jobDetailsCache}} holds the {{JobDetails}} of each job in cache. It has a 
default expiration time of 3600s (see [jobstore.expiration-time 
default|https://github.com/apache/flink/blob/f3598c50c0d3dcdf8058b01f13b7eb9fc5954f7c/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java#L340]).
 The size of the cache is limited to {{Integer.MAX_VALUE}} entries (see 
[jobstore.max-capacity 
default|https://github.com/apache/flink/blob/f3598c50c0d3dcdf8058b01f13b7eb9fc5954f7c/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java#L350]).
 Removing entries from this cache will trigger the removal of the file and 
invalidates the corresponding entries in both caches. Removal can happen even 
earlier, though
* {{executionGraphInfoCache}} holds the {{ExecutionGraphInfo}} in the cache. 
It's limited to 50MB (see [jobstore.cache-size 
default|https://github.com/apache/flink/blob/f3598c50c0d3dcdf8058b01f13b7eb9fc5954f7c/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java#L331]).
 The ExecutionGraphInfo will be reloaded from disk in case it was removed from 
the cache before.

This error can be simulated by reducing either the expiry time of the 
{{jobDetailsCache}} or decreasing the entry limit of that cache.

One suspicion is that the entry gets removed earlier. The 
[CacheBuilder#maximumSize 
JavaDoc|https://guava.dev/releases/19.0/api/docs/com/google/common/cache/CacheBuilder.html#maximumSize(long)]
 states that this can happen:
{quote}
Note that the cache may evict an entry before this limit is exceeded.
{quote}

We should investigate whether the error started to appear with the 
{{flink-shaded}} update (FLINK-30772, FLINK-32032) where the update might have 
included a change in how the cache operates. I will continue investigating this 
tomorrow.


was (Author: mapohl):
The error appears in the 
[FileExecutionGraphInfoStore#getAvailableJobDetails(JobID)|https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/FileExecutionGraphInfoStore.java#L237]
 where the JobDetails are not cached. The {{FileSystemExecutionGraphInfoStore}} 
uses two caches:
* {{jobDetailsCache}} holds the {{JobDetails}} of each job in cache. It has a 
default expiration time of 3600s (see [jobstore.expiration-time 
default|https://github.com/apache/flink/blob/f3598c50c0d3dcdf8058b01f13b7eb9fc5954f7c/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java#L340]).
 The size of the cache is limited to {{Integer.MAX_VALUE}} entries (see 
[jobstore.max-capacity 
default|https://github.com/apache/flink/blob/f3598c50c0d3dcdf8058b01f13b7eb9fc5954f7c/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java#L350]).
 Removing entries from this cache will trigger the removal of the file and 
invalidates the corresponding entries in both caches. Removal can happen even 
earlier, though
* {{executionGraphInfoCache}} holds the {{ExecutionGraphInfo}} in the cache. 
It's limited to 50MB (see [jobstore.cache-size 
default|https://github.com/apache/flink/blob/f3598c50c0d3dcdf8058b01f13b7eb9fc5954f7c/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java#L331]).
 The ExecutionGraphInfo will be reloaded from disk in case it was removed from 
the cache before.

This error can be simulated by reducing either the expiry time of the 
{{jobDetailsCache}} or decreasing the entry limit of that cache.

One suspicion is that the entry gets removed earlier. The 
[CacheBuilder#maximumSize 
JavaDoc|https://guava.dev/releases/19.0/api/docs/com/google/common/cache/CacheBuilder.html#maximumSize(long)]
 states that this can happen:
{quote}
Note that the cache may evict an entry before this limit is exceeded.
{quote}

We should investigate whether the error started to appear with the 
{{flink-shaded}} update where the update might have included a change in how 
the cache operates. I will continue investigating this tomorrow.

> JobManagerHAProcessFailureRecoveryITCase failed due to job not being found
> --------------------------------------------------------------------------
>
>                 Key: FLINK-31168
>                 URL: https://issues.apache.org/jira/browse/FLINK-31168
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.3, 1.16.1, 1.18.0
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46342&view=logs&j=b0a398c0-685b-599c-eb57-c8c2a771138e&t=747432ad-a576-5911-1e2a-68c6bedc248a&l=12706
> We see this build failure because a job couldn't be found:
> {code}
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error while waiting for job to be initialized
>       at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:319)
>       at 
> org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1061)
>       at 
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:958)
>       at 
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:942)
>       at 
> org.apache.flink.test.recovery.JobManagerHAProcessFailureRecoveryITCase.testJobManagerFailure(JobManagerHAProcessFailureRecoveryITCase.java:235)
>       at 
> org.apache.flink.test.recovery.JobManagerHAProcessFailureRecoveryITCase$4.run(JobManagerHAProcessFailureRecoveryITCase.java:336)
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error while waiting for job to be initialized
>       at 
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
>       at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
>       at 
> org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1056)
>       ... 4 more
> Caused by: java.lang.RuntimeException: Error while waiting for job to be 
> initialized
>       at 
> org.apache.flink.client.ClientUtils.waitUntilJobInitializationFinished(ClientUtils.java:160)
>       at 
> org.apache.flink.client.deployment.executors.AbstractSessionClusterExecutor.lambda$execute$2(AbstractSessionClusterExecutor.java:82)
>       at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
>       at 
> java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479)
>       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
>       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
>       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.rest.util.RestClientException: 
> [org.apache.flink.runtime.rest.NotFoundException: Job 
> 865dcd87f4828dbeb3d93eb52e2636b1 not found
>       at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387)
>       at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$1(ClassLoadingUtils.java:93)
>       at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>       at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45)
>       at akka.dispatch.OnComplete.internal(Future.scala:299)
>       at akka.dispatch.OnComplete.internal(Future.scala:297)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>       at 
> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65)
>       at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>       at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621)
>       at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:25)
>       at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:23)
>       at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:532)
>       at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>       at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>       at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
>       at 
> akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
>       at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>       at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
>       at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
>       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
>       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
>       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (865dcd87f4828dbeb3d93eb52e2636b1)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.requestExecutionGraphInfo(Dispatcher.java:840)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:304)
>       at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:302)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>       at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>       at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>       at akka.actor.Actor.aroundReceive(Actor.scala:537)
>       at akka.actor.Actor.aroundReceive$(Actor.scala:535)
>       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:548)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:231)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
>       ... 5 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to