[
https://issues.apache.org/jira/browse/FLINK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-40067:
-----------------------------------
Labels: pull-request-available test-stability (was: test-stability)
> RescaleTimelineITCase.testRescaleTerminatedByJobFinished fails due to race
> between cooldown-driven rescale termination and job finish
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-40067
> URL: https://issues.apache.org/jira/browse/FLINK-40067
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 2.4.0
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76571&view=results
> (leg: test_cron_hadoop313_core)
> {noformat}
> 04:56:58.365 [ERROR]
> org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished
> -- Time elapsed: 10.85 s <<< ERROR!
> java.util.concurrent.TimeoutException: Condition was not met within 10000 ms.
> at
> org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:218)
> at
> org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.waitUntilConditionWithTimeout(RescaleTimelineITCase.java:684)
> at
> org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished(RescaleTimelineITCase.java:292)
> {noformat}
> The test asserts that the in-progress rescale opened by
> {{updateJobResourceRequirements}} is terminated with {{JOB_FINISHED}}. That
> reason is stamped by {{AdaptiveScheduler#goToFinished}} via
> {{RescaleTimeline#updateRescale}}, which is a no-op once the rescale is
> already terminated ({{DefaultRescaleTimeline#isIdling}}).
> The update's upper bound (PARALLELISM * 2) exceeds the available slots, so
> the rescale cannot change parallelism. With the short cooldown (100 ms)
> shared by the parameterized configuration, {{DefaultStateTransitionManager}}
> re-enters {{Idling}} on a wall-clock timer and the {{Idling}} constructor
> terminates that rescale with {{NO_RESOURCES_OR_PARALLELISMS_CHANGE}}. On a
> loaded machine this happens before the unblocked job finishes, so
> {{goToFinished}} finds the rescale already terminated and the awaited
> condition can never be met. The timeout only became observable after
> FLINK-40009 made the wait helper's budget real.
> This is a test-side timing assumption, not a product bug: both terminal
> reasons are legitimate, and no amount of waiting lets the job finish win the
> race. Same race class as the sibling fixes in FLINK-39903 and FLINK-40010.
> Proposed fix: rebuild the fixture cluster with a widened cooldown via the
> existing {{rebuildClusterWithExecutingTimeouts}} helper so the in-progress
> rescale outlives the unblock-to-finish window, mirroring FLINK-39903.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)