Matthias Pohl created FLINK-38997:
-------------------------------------
Summary: Job can get stuck in state transition because unexpected
error during ExecutionGraph creation
Key: FLINK-38997
URL: https://issues.apache.org/jira/browse/FLINK-38997
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 2.1.1, 2.2.0, 2.0.1
Reporter: Matthias Pohl
We observed a job not responding to rescale change events anymore. The reason
for this was that the job's {{AdaptiveScheduler}} instance was in
{{WaitingForResources}} state with its {{StateTransitionManager}} being in
{{Finished}} phase where it didn't process any incoming change events anymore
(because the phase's methods are not implemented).
This was caused by {{createExecutionGraphWithAvailableResourcesAsync}} failing
in its synchronous part in
[AdaptiveScheduler:1360|https://github.com/apache/flink/blob/e0de06e45852b2348c6c80c2a9ed089da645f7cb/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L1360]
due to a {{OutOfMemoryError}}. Instead of continuing the state transition, the
method call failed and the error was forwarded to the calling code (which was a
offerSlots rpc request from a TM where the error was logged).
Subsequent scaling decision due to resource change events were ignored as a
consequence. All other jobs in the cluster continued running as normal.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)