zhijiang created FLINK-6325:
-------------------------------
Summary: Refinement of slot reuse for task manager failure
Key: FLINK-6325
URL: https://issues.apache.org/jira/browse/FLINK-6325
Project: Flink
Issue Type: Improvement
Components: JobManager
Reporter: zhijiang
Priority: Minor
After task or TaskManager failure, the new execution attempt tries to take the
slot from prior execution by default. It can get benefits for state recovery
locality by RocksDB backend, and it actually makes sense for task failure
scenario.
But for TaskManager failure scenario, the inside slot is recycled and can not
be reused any more. When the inside execution resets to allocate slot from
{{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to
match any other available slots by {{ResourceProfile}}. As a result, the other
parallel execution's slot will be occupied by this execution in failed
{{TaskManager}}, and all the following executions may not reuse the previous
slots any more. It will bring bad effects for state recovery.
To solve this problem, we would like to request a new slot for re-deployment
when attached with an unavailable location, so it will not occupy the other
alive slots any more.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)