[ https://issues.apache.org/jira/browse/FLINK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhijiang reassigned FLINK-6325: ------------------------------- Assignee: zhijiang > Refinement of slot reuse for task manager failure > ------------------------------------------------- > > Key: FLINK-6325 > URL: https://issues.apache.org/jira/browse/FLINK-6325 > Project: Flink > Issue Type: Improvement > Components: JobManager > Reporter: zhijiang > Assignee: zhijiang > Priority: Minor > > After task or TaskManager failure, the new execution attempt tries to take > the slot from prior execution by default. It can get benefits for state > recovery locality by RocksDB backend, and it actually makes sense for task > failure scenario. > But for TaskManager failure scenario, the inside slot is recycled and can not > be reused any more. When the inside execution resets to allocate slot from > {{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to > match any other available slots by {{ResourceProfile}}. As a result, the > other parallel execution's slot will be occupied by this execution in failed > {{TaskManager}}, and all the following executions may not reuse the previous > slots any more. It will bring bad effects for state recovery. > To solve this problem, we would like to request a new slot for re-deployment > when attached with an unavailable location, so it will not occupy the other > alive slots any more. -- This message was sent by Atlassian JIRA (v6.3.15#6346)