[ 
https://issues.apache.org/jira/browse/FLINK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-40067:
-----------------------------------
    Labels: pull-request-available test-stability  (was: test-stability)

> RescaleTimelineITCase.testRescaleTerminatedByJobFinished fails due to race 
> between cooldown-driven rescale termination and job finish
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-40067
>                 URL: https://issues.apache.org/jira/browse/FLINK-40067
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 2.4.0
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76571&view=results
>  (leg: test_cron_hadoop313_core)
> {noformat}
> 04:56:58.365 [ERROR] 
> org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished
>  -- Time elapsed: 10.85 s <<< ERROR!
> java.util.concurrent.TimeoutException: Condition was not met within 10000 ms.
>       at 
> org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:218)
>       at 
> org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.waitUntilConditionWithTimeout(RescaleTimelineITCase.java:684)
>       at 
> org.apache.flink.runtime.scheduler.adaptive.timeline.RescaleTimelineITCase.testRescaleTerminatedByJobFinished(RescaleTimelineITCase.java:292)
> {noformat}
> The test asserts that the in-progress rescale opened by 
> {{updateJobResourceRequirements}} is terminated with {{JOB_FINISHED}}. That 
> reason is stamped by {{AdaptiveScheduler#goToFinished}} via 
> {{RescaleTimeline#updateRescale}}, which is a no-op once the rescale is 
> already terminated ({{DefaultRescaleTimeline#isIdling}}).
> The update's upper bound (PARALLELISM * 2) exceeds the available slots, so 
> the rescale cannot change parallelism. With the short cooldown (100 ms) 
> shared by the parameterized configuration, {{DefaultStateTransitionManager}} 
> re-enters {{Idling}} on a wall-clock timer and the {{Idling}} constructor 
> terminates that rescale with {{NO_RESOURCES_OR_PARALLELISMS_CHANGE}}. On a 
> loaded machine this happens before the unblocked job finishes, so 
> {{goToFinished}} finds the rescale already terminated and the awaited 
> condition can never be met. The timeout only became observable after 
> FLINK-40009 made the wait helper's budget real.
> This is a test-side timing assumption, not a product bug: both terminal 
> reasons are legitimate, and no amount of waiting lets the job finish win the 
> race. Same race class as the sibling fixes in FLINK-39903 and FLINK-40010.
> Proposed fix: rebuild the fixture cluster with a widened cooldown via the 
> existing {{rebuildClusterWithExecutingTimeouts}} helper so the in-progress 
> rescale outlives the unblock-to-finish window, mirroring FLINK-39903.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to