[jira] [Commented] (FLINK-38403) UnalignedCheckpointITCase failed in test_cron_hadoop313 tests

Arvid Heise (Jira) Mon, 13 Oct 2025 02:48:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-38403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029450#comment-18029450
 ]


Arvid Heise commented on FLINK-38403:
-------------------------------------

Let me take a look. The error is indeed intended but I currently don't know if 
the test or the exception handling is flaky. I'll take a quick tally on the 
last 5 failed tests in reverse order:

Failed variant:
[union with mixed channels, p = 10, timeout = 0]

[multi_input with mixed channels, p = 5, timeout = 0]

[union with mixed channels, p = 5, timeout = 0]

[multi_input with mixed channels, p = 5, timeout = 0]

[union with mixed channels, p = 5, timeout = 0] 

They all share the same common issue

 
{noformat}
Oct 09 04:57:46 org.apache.flink.runtime.client.JobExecutionException: Job 
execution failed.
Oct 09 04:57:46         at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
Oct 09 04:57:46         at 
org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:192)
Oct 09 04:57:46         at 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:280)
Oct 09 04:57:46         at 
java.base/java.lang.reflect.Method.invoke(Method.java:568)
Oct 09 04:57:46         at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Oct 09 04:57:46 Caused by: org.apache.flink.runtime.JobException: Recovery is 
suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5, 
backoffTimeMS=100)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:213)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:163)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:118)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.DefaultScheduler.recordTaskFailure(DefaultScheduler.java:294)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:285)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.DefaultScheduler.onTaskFailed(DefaultScheduler.java:278)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.SchedulerBase.onTaskExecutionStateUpdate(SchedulerBase.java:836)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:813)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:51)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(DefaultExecutionGraph.java:1725)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1361)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1301)
Oct 09 04:57:46         at 
org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1140)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.DefaultExecutionOperations.markFailed(DefaultExecutionOperations.java:43)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.DefaultExecutionDeployer.handleTaskDeploymentFailure(DefaultExecutionDeployer.java:329)
Oct 09 04:57:46         at 
org.apache.flink.runtime.scheduler.DefaultExecutionDeployer.lambda$assignAllResourcesAndRegisterProducedPartitions$2(DefaultExecutionDeployer.java:169)
Oct 09 04:57:46         at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
Oct 09 04:57:46         at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
Oct 09 04:57:46         at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
Oct 09 04:57:46         at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
Oct 09 04:57:46         at 
org.apache.flink.runtime.jobmaster.slotpool.PendingRequest.failRequest(PendingRequest.java:88)
Oct 09 04:57:46         at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.cancelPendingRequests(DeclarativeSlotPoolBridge.java:191)
Oct 09 04:57:46         at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.failPendingRequests(DeclarativeSlotPoolBridge.java:475)
Oct 09 04:57:46         at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.notifyNotEnoughResourcesAvailable(DeclarativeSlotPoolBridge.java:463)
Oct 09 04:57:46         at 
org.apache.flink.runtime.jobmaster.JobMaster.notifyNotEnoughResourcesAvailable(JobMaster.java:984)
Oct 09 04:57:46         at 
java.base/java.lang.reflect.Method.invoke(Method.java:568)
Oct 09 04:57:46         at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$0(PekkoRpcActor.java:310)
Oct 09 04:57:46         at 
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
Oct 09 04:57:46         at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:309)
Oct 09 04:57:46         at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:229)
Oct 09 04:57:46         at 
org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88)
Oct 09 04:57:46         at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174)
Oct 09 04:57:46         at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
Oct 09 04:57:46         at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
Oct 09 04:57:46         at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)
Oct 09 04:57:46         ... 35 more
Oct 09 04:57:46 Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not acquire the minimum required resources.
Oct 09 04:57:46 
{noformat}
I suspect that some tasks take longer too clean up and thus not enough slots 
are available.

I'll double-check the test design. I suspect that I used the number of restarts 
to ensure certain behavior but that limits the robustness of the test as we may 
require more restarts in edge cases.

 

> UnalignedCheckpointITCase failed in test_cron_hadoop313 tests
> -------------------------------------------------------------
>
>                 Key: FLINK-38403
>                 URL: https://issues.apache.org/jira/browse/FLINK-38403
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Priority: Major
>
> Details:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69810&view=logs&j=baf26b34-3c6a-54e8-f93f-cf269b32f802&t=b380e762-00fc-5c06-e76c-b8e53634ca34



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-38403) UnalignedCheckpointITCase failed in test_cron_hadoop313 tests

Reply via email to