[
https://issues.apache.org/jira/browse/FLINK-38845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059427#comment-18059427
]
Martijn Visser commented on FLINK-38845:
----------------------------------------
[~zhuzh][~yizh] The Github Actions pipeline quite often fails on
PackagedProgramApplicationITCase with
{code:java}
2026-02-18T13:48:45.7265881Z Feb 18 13:48:45 13:48:45.725 [ERROR]
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.testDispatcherRecoversAfterLosingAndRegainingLeadership
-- Time elapsed:
1.312 s <<< ERROR!
2026-02-18T13:48:45.7269646Z Feb 18 13:48:45 java.lang.IllegalStateException:
MiniCluster is not yet running or has already been shut down.
2026-02-18T13:48:45.7270733Z Feb 18 13:48:45 at
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
2026-02-18T13:48:45.7271885Z Feb 18 13:48:45 at
org.apache.flink.runtime.minicluster.MiniCluster.getDispatcherGatewayFuture(MiniCluster.java:1137)
2026-02-18T13:48:45.7273354Z Feb 18 13:48:45 at
org.apache.flink.runtime.minicluster.TestingMiniCluster.getDispatcherGatewayFuture(TestingMiniCluster.java:212)
2026-02-18T13:48:45.7274844Z Feb 18 13:48:45 at
org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:993)
2026-02-18T13:48:45.7276007Z Feb 18 13:48:45 at
org.apache.flink.runtime.minicluster.MiniCluster.getJobStatus(MiniCluster.java:870)
2026-02-18T13:48:45.7279056Z Feb 18 13:48:45 at
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.lambda$awaitJobStatus$7(PackagedProgramApplicationITCase.java:265)
2026-02-18T13:48:45.7280774Z Feb 18 13:48:45 at
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:152)
2026-02-18T13:48:45.7282454Z Feb 18 13:48:45 at
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:146)
2026-02-18T13:48:45.7284301Z Feb 18 13:48:45 at
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.awaitJobStatus(PackagedProgramApplicationITCase.java:262)
2026-02-18T13:48:45.7288895Z Feb 18 13:48:45 at
org.apache.flink.client.deployment.application.PackagedProgramApplicationITCase.testDispatcherRecoversAfterLosingAndRegainingLeadership(PackagedProgramApplicationITCase.java:137)
2026-02-18T13:48:45.7291834Z Feb 18 13:48:45 at
java.base/java.lang.reflect.Method.invoke(Method.java:568)
2026-02-18T13:48:45.7292741Z Feb 18 13:48:45 at
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
2026-02-18T13:48:45.7293836Z Feb 18 13:48:45 at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
2026-02-18T13:48:45.7294899Z Feb 18 13:48:45 at
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
2026-02-18T13:48:45.7295922Z Feb 18 13:48:45 at
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
2026-02-18T13:48:45.7297219Z Feb 18 13:48:45 at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
2026-02-18T13:48:45.7297995Z Feb 18 13:48:45
2026-02-18T13:48:46.0684980Z Feb 18 13:48:46 13:48:46.067 [INFO]
2026-02-18T13:48:46.0685925Z Feb 18 13:48:46 13:48:46.067 [INFO] Results:
2026-02-18T13:48:46.0686363Z Feb 18 13:48:46 13:48:46.067 [INFO]
2026-02-18T13:48:46.0687044Z Feb 18 13:48:46 13:48:46.068 [ERROR] Errors:
2026-02-18T13:48:46.0690296Z Feb 18 13:48:46 13:48:46.068 [ERROR]
PackagedProgramApplicationITCase.testDispatcherRecoversAfterLosingAndRegainingLeadership:137->awaitJobStatus:262->lambda$awaitJobStatus$7:265
»
IllegalState MiniCluster is not yet running or has already been shut down.
2026-02-18T13:48:46.0692413Z Feb 18 13:48:46 13:48:46.068 [INFO]
{code}
Since there's a refactoring done as part of this FLIP on how dispatch
leadership loss is handled, I wonder if this test has a miss.
testSubmitFailedJobOnApplicationError already has
SHUTDOWN_ON_APPLICATION_FINISH to set false, but
testDispatcherRecoversAfterLosingAndRegainingLeadership does not. Could you
take a look?
> Add ArchivedApplicationStore to manage terminated applications
> --------------------------------------------------------------
>
> Key: FLINK-38845
> URL: https://issues.apache.org/jira/browse/FLINK-38845
> Project: Flink
> Issue Type: Sub-task
> Reporter: Yi Zhang
> Assignee: Yi Zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.3.0
>
>
> Replace ExecutionGraphInfoStore with ArchivedApplicationStore to manage
> terminated applications (rather than individual jobs) and handle their
> expiration.
> With the introduction of applications, every job is now explicitly associated
> with an application. Previously, the {{ExecutionGraphInfoStore}} was used to
> manage and expire completed jobs individually. However, this approach no
> longer works well in the application-centric model.
> If we continue using {{ExecutionGraphInfoStore}} to expire individual
> completed jobs, it’s possible that only some jobs within an application get
> expired and removed, while others remain. This leads to an incomplete view of
> the application’s state, because parts of its job history become unavailable.
> To preserve application-level consistency and completeness, we introduce the
> {{{}ArchivedApplicationStore{}}}. Instead of expiring jobs independently,
> this new store manages entire applications (including all their jobs) as a
> whole, ensuring complete, consistent, and queryable application state until
> explicitly discarded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)