Martijn Visser created FLINK-39903:
--------------------------------------
Summary:
RescaleTimelineITCase.testRescaleTerminatedByResourceRequirementsUpdated is
flaky: second resource-requirements update can miss the in-progress rescale
Key: FLINK-39903
URL: https://issues.apache.org/jira/browse/FLINK-39903
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Reporter: Martijn Visser
Assignee: Martijn Visser
See
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621
testRescaleTerminatedByResourceRequirementsUpdated asserts that the second
updateJobResourceRequirements RPC terminates the in-progress rescale started by
the first update with terminal reason RESOURCE_REQUIREMENTS_UPDATED. That setter
(AdaptiveScheduler#recordRescaleForNewResourceRequirements via
RescaleTimeline#updateRescale) is a no-op once the current rescale is already
terminated (DefaultRescaleTimeline#isIdling).
The requested upper bound exceeds available slots, so the first rescale cannot
change parallelism. With the short cooldown (100 ms) and resource-stabilization
(50 ms) timeouts shared by the parameterized configuration, the
DefaultStateTransitionManager re-enters Idling and terminates the in-progress
rescale with NO_RESOURCES_OR_PARALLELISMS_CHANGE. Those are wall-clock timers
that start when the first rescale is recorded, so on a slow machine the rescale
is terminated before the second update RPC is processed, and the second update
finds it already terminated, producing the flaky assertion failure.
This is a test-side timing assumption, not a product bug; re-entering Idling and
recording NO_RESOURCES_OR_PARALLELISMS_CHANGE is correct behaviour.
Proposed fix (test-only): for this case only, rebuild the fixture cluster in
place with widened cooldown/stabilization (60 s) so the in-progress rescale
stays alive across the single synchronous RPC round trip between the two
updates. The shared parameterized configuration used by the other cases is left
untouched; the disabled-history parameter is skipped up front.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)