Martijn Visser created FLINK-40067:
--------------------------------------

             Summary: RescaleTimelineITCase.testRescaleTerminatedByJobFinished 
fails due to race between cooldown-driven rescale termination and job finish
                 Key: FLINK-40067
                 URL: https://issues.apache.org/jira/browse/FLINK-40067
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 2.4.0
            Reporter: Martijn Visser
            Assignee: Martijn Visser


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76571&view=results
 (leg: test_cron_hadoop313_core)

{noformat}
04:56:58.365 [ERROR] 
org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished
 -- Time elapsed: 10.85 s <<< ERROR!
java.util.concurrent.TimeoutException: Condition was not met within 10000 ms.
      at 
org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:218)
      at 
org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.waitUntilConditionWithTimeout(RescaleTimelineITCase.java:684)
      at 
org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished(RescaleTimelineITCase.java:292)
{noformat}

The test asserts that the in-progress rescale opened by 
{{updateJobResourceRequirements}} is terminated with {{JOB_FINISHED}}. That 
reason is stamped by {{AdaptiveScheduler#goToFinished}} via 
{{RescaleTimeline#updateRescale}}, which is a no-op once the rescale is already 
terminated ({{DefaultRescaleTimeline#isIdling}}).

The update's upper bound (PARALLELISM * 2) exceeds the available slots, so the 
rescale cannot change parallelism. With the short cooldown (100 ms) shared by 
the parameterized configuration, {{DefaultStateTransitionManager}} re-enters 
{{Idling}} on a wall-clock timer and the {{Idling}} constructor terminates that 
rescale with {{NO_RESOURCES_OR_PARALLELISMS_CHANGE}}. On a loaded machine this 
happens before the unblocked job finishes, so {{goToFinished}} finds the 
rescale already terminated and the awaited condition can never be met. The 
timeout only became observable after FLINK-40009 made the wait helper's budget 
real.

This is a test-side timing assumption, not a product bug: both terminal reasons 
are legitimate, and no amount of waiting lets the job finish win the race. Same 
race class as the sibling fixes in FLINK-39903 and FLINK-40010.

Proposed fix: rebuild the fixture cluster with a widened cooldown via the 
existing {{rebuildClusterWithExecutingTimeouts}} helper so the in-progress 
rescale outlives the unblock-to-finish window, mirroring FLINK-39903.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to