Matthias Pohl created FLINK-38997:
-------------------------------------

             Summary: Job can get stuck in state transition because unexpected 
error during ExecutionGraph creation
                 Key: FLINK-38997
                 URL: https://issues.apache.org/jira/browse/FLINK-38997
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 2.1.1, 2.2.0, 2.0.1
            Reporter: Matthias Pohl


We observed a job not responding to rescale change events anymore. The reason 
for this was that the job's {{AdaptiveScheduler}} instance was in 
{{WaitingForResources}} state with its {{StateTransitionManager}} being in 
{{Finished}} phase where it didn't process any incoming change events anymore 
(because the phase's methods are not implemented).

This was caused by {{createExecutionGraphWithAvailableResourcesAsync}} failing 
in its synchronous part in 
[AdaptiveScheduler:1360|https://github.com/apache/flink/blob/e0de06e45852b2348c6c80c2a9ed089da645f7cb/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L1360]
 due to a {{OutOfMemoryError}}. Instead of continuing the state transition, the 
method call failed and the error was forwarded to the calling code (which was a 
offerSlots rpc request from a TM where the error was logged).

Subsequent scaling decision due to resource change events were ignored as a 
consequence. All other jobs in the cluster continued running as normal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to