[ 
https://issues.apache.org/jira/browse/FLINK-38845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059427#comment-18059427
 ] 

Martijn Visser commented on FLINK-38845:
----------------------------------------

[~zhuzh][~yizh] The Github Actions pipeline quite often fails on 
PackagedProgramApplicationITCase with 

{code:java}
2026-02-18T13:48:45.7265881Z Feb 18 13:48:45 13:48:45.725 [ERROR] 
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.testDispatcherRecoversAfterLosingAndRegainingLeadership
 -- Time elapsed:
  1.312 s <<< ERROR!
  2026-02-18T13:48:45.7269646Z Feb 18 13:48:45 java.lang.IllegalStateException: 
MiniCluster is not yet running or has already been shut down.
  2026-02-18T13:48:45.7270733Z Feb 18 13:48:45     at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
  2026-02-18T13:48:45.7271885Z Feb 18 13:48:45     at 
org.apache.flink.runtime.minicluster.MiniCluster.getDispatcherGatewayFuture(MiniCluster.java:1137)
  2026-02-18T13:48:45.7273354Z Feb 18 13:48:45     at 
org.apache.flink.runtime.minicluster.TestingMiniCluster.getDispatcherGatewayFuture(TestingMiniCluster.java:212)
  2026-02-18T13:48:45.7274844Z Feb 18 13:48:45     at 
org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:993)
  2026-02-18T13:48:45.7276007Z Feb 18 13:48:45     at 
org.apache.flink.runtime.minicluster.MiniCluster.getJobStatus(MiniCluster.java:870)
  2026-02-18T13:48:45.7279056Z Feb 18 13:48:45     at 
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.lambda$awaitJobStatus$7(PackagedProgramApplicationITCase.java:265)
  2026-02-18T13:48:45.7280774Z Feb 18 13:48:45     at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:152)
  2026-02-18T13:48:45.7282454Z Feb 18 13:48:45     at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:146)
  2026-02-18T13:48:45.7284301Z Feb 18 13:48:45     at 
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.awaitJobStatus(PackagedProgramApplicationITCase.java:262)
  2026-02-18T13:48:45.7288895Z Feb 18 13:48:45     at
  
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.testDispatcherRecoversAfterLosingAndRegainingLeadership(PackagedProgramApplicationITCase.java:137)
  2026-02-18T13:48:45.7291834Z Feb 18 13:48:45     at 
java.base/java.lang.reflect.Method.invoke(Method.java:568)
  2026-02-18T13:48:45.7292741Z Feb 18 13:48:45     at 
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
  2026-02-18T13:48:45.7293836Z Feb 18 13:48:45     at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
  2026-02-18T13:48:45.7294899Z Feb 18 13:48:45     at 
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
  2026-02-18T13:48:45.7295922Z Feb 18 13:48:45     at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
  2026-02-18T13:48:45.7297219Z Feb 18 13:48:45     at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
  2026-02-18T13:48:45.7297995Z Feb 18 13:48:45
  2026-02-18T13:48:46.0684980Z Feb 18 13:48:46 13:48:46.067 [INFO]
  2026-02-18T13:48:46.0685925Z Feb 18 13:48:46 13:48:46.067 [INFO] Results:
  2026-02-18T13:48:46.0686363Z Feb 18 13:48:46 13:48:46.067 [INFO]
  2026-02-18T13:48:46.0687044Z Feb 18 13:48:46 13:48:46.068 [ERROR] Errors:
  2026-02-18T13:48:46.0690296Z Feb 18 13:48:46 13:48:46.068 [ERROR]   
PackagedProgramApplicationITCase.testDispatcherRecoversAfterLosingAndRegainingLeadership:137->awaitJobStatus:262->lambda$awaitJobStatus$7:265
 »
  IllegalState MiniCluster is not yet running or has already been shut down.
  2026-02-18T13:48:46.0692413Z Feb 18 13:48:46 13:48:46.068 [INFO]
{code}

Since there's a refactoring done as part of this FLIP on how dispatch 
leadership loss is handled, I wonder if this test has a miss. 
testSubmitFailedJobOnApplicationError already has 
SHUTDOWN_ON_APPLICATION_FINISH to set false, but 
testDispatcherRecoversAfterLosingAndRegainingLeadership does not. Could you 
take a look? 

> Add ArchivedApplicationStore to manage terminated applications
> --------------------------------------------------------------
>
>                 Key: FLINK-38845
>                 URL: https://issues.apache.org/jira/browse/FLINK-38845
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Yi Zhang
>            Assignee: Yi Zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.3.0
>
>
> Replace ExecutionGraphInfoStore with ArchivedApplicationStore to manage 
> terminated applications (rather than individual jobs) and handle their 
> expiration.
> With the introduction of applications, every job is now explicitly associated 
> with an application. Previously, the {{ExecutionGraphInfoStore}} was used to 
> manage and expire completed jobs individually. However, this approach no 
> longer works well in the application-centric model.
> If we continue using {{ExecutionGraphInfoStore}} to expire individual 
> completed jobs, it’s possible that only some jobs within an application get 
> expired and removed, while others remain. This leads to an incomplete view of 
> the application’s state, because parts of its job history become unavailable.
> To preserve application-level consistency and completeness, we introduce the 
> {{{}ArchivedApplicationStore{}}}. Instead of expiring jobs independently, 
> this new store manages entire applications (including all their jobs) as a 
> whole, ensuring complete, consistent, and queryable application state until 
> explicitly discarded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to