Martijn Visser created FLINK-39902:
--------------------------------------
Summary: RescaleTimelineITCase.testRescaleTerminatedByJobFinished
fails due to race between task unblock and recorded rescale
Key: FLINK-39902
URL: https://issues.apache.org/jira/browse/FLINK-39902
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Reporter: Martijn Visser
Assignee: Martijn Visser
testRescaleTerminatedByJobFinished is flaky on slow/loaded CI and has failed on
both the default and adaptive scheduler legs of the master mirror:
- 20260609.1 (buildId 75795), test_cron_adaptive_scheduler core
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75795
- 20260604.4 (buildId 75621), test_ci core
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621
Failure:
{noformat}
RescaleTimelineITCase.testRescaleTerminatedByJobFinished:284
->waitUntilConditionWithTimeout:660 ยป Timeout
{noformat}
Root cause:
The test submits a blocking job at parallelism 4 (full cluster capacity) and
requests an upscale to parallelism 8. Because 8 exceeds the available slots, the
rescale never changes the running parallelism and is only observable as a second
entry in the recorded rescale history (added by DefaultRescaleTimeline when the
rescale starts). The test calls OnceBlockingNoOpInvokable.unblock() immediately
after the requirement update, racing the scheduler's reaction to that update. On
a slow machine the no-op task finishes before the second rescale is started and
recorded, so the history stays at size 1 and the size-2 / JOB_FINISHED condition
times out after 10s. Sibling tests avoid this by waiting for the new parallelism
via waitForVertexParallelismReachedAndJobRunning before unblocking, but that
helper cannot be used here since parallelism 8 is unreachable.
Proposed fix (test-only, assertion-preserving):
Wait until the second rescale has been recorded (history size == 2) before
unblocking the task, so the in-progress rescale resolves to JOB_FINISHED once
the
job finishes. Move the assumeThat(enabledRescaleHistory(...)) ahead of the
requirement update so the disabled-history variant skips cleanly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)