Xintong Song created FLINK-13555:
------------------------------------

             Summary: Failures of slot requests requiring unfulfillable managed 
memory should not be ignored.
                 Key: FLINK-13555
                 URL: https://issues.apache.org/jira/browse/FLINK-13555
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.9.0
            Reporter: Xintong Song
             Fix For: 1.9.0
         Attachments: flink-unk-standalonesession-0-u-home.log, 
flink-unk-taskexecutor-0-u-home.log

Currently, SlotPool ignores failures of requesting slots from ResourceManager 
for all batch slot requests. The idea behind this is to allow batch slot 
requests pending at SlotPool and waiting for other tasks to finish and release 
slots. A slot request will be failed only if it is not fulfilled in its timeout.

However, there could be two kinds of request slots from RM failures.
 # RM does not have available slots. All slots are in use at the moment. But 
they might become available later when the currently running tasks finish.
 # The slot request requires too many resources that can not be fulfilled by 
any slot (available or not) in the cluster. The request is also not likely to 
be fulfilled later.

For the 2nd kinds of failures, it doesn't make sense to wait for the timeout. 
We should fail the job immediately, with proper error messages describing the 
problem and suggesting the user to tune job or cluster configurations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to