Martijn Visser created FLINK-40067:
--------------------------------------
Summary: RescaleTimelineITCase.testRescaleTerminatedByJobFinished
fails due to race between cooldown-driven rescale termination and job finish
Key: FLINK-40067
URL: https://issues.apache.org/jira/browse/FLINK-40067
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 2.4.0
Reporter: Martijn Visser
Assignee: Martijn Visser
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76571&view=results
(leg: test_cron_hadoop313_core)
{noformat}
04:56:58.365 [ERROR]
org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished
-- Time elapsed: 10.85 s <<< ERROR!
java.util.concurrent.TimeoutException: Condition was not met within 10000 ms.
at
org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:218)
at
org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.waitUntilConditionWithTimeout(RescaleTimelineITCase.java:684)
at
org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished(RescaleTimelineITCase.java:292)
{noformat}
The test asserts that the in-progress rescale opened by
{{updateJobResourceRequirements}} is terminated with {{JOB_FINISHED}}. That
reason is stamped by {{AdaptiveScheduler#goToFinished}} via
{{RescaleTimeline#updateRescale}}, which is a no-op once the rescale is already
terminated ({{DefaultRescaleTimeline#isIdling}}).
The update's upper bound (PARALLELISM * 2) exceeds the available slots, so the
rescale cannot change parallelism. With the short cooldown (100 ms) shared by
the parameterized configuration, {{DefaultStateTransitionManager}} re-enters
{{Idling}} on a wall-clock timer and the {{Idling}} constructor terminates that
rescale with {{NO_RESOURCES_OR_PARALLELISMS_CHANGE}}. On a loaded machine this
happens before the unblocked job finishes, so {{goToFinished}} finds the
rescale already terminated and the awaited condition can never be met. The
timeout only became observable after FLINK-40009 made the wait helper's budget
real.
This is a test-side timing assumption, not a product bug: both terminal reasons
are legitimate, and no amount of waiting lets the job finish win the race. Same
race class as the sibling fixes in FLINK-39903 and FLINK-40010.
Proposed fix: rebuild the fixture cluster with a widened cooldown via the
existing {{rebuildClusterWithExecutingTimeouts}} helper so the in-progress
rescale outlives the unblock-to-finish window, mirroring FLINK-39903.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)