[ 
https://issues.apache.org/jira/browse/FLINK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-6325:
-------------------------------

    Assignee: zhijiang

> Refinement of slot reuse for task manager failure
> -------------------------------------------------
>
>                 Key: FLINK-6325
>                 URL: https://issues.apache.org/jira/browse/FLINK-6325
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>            Reporter: zhijiang
>            Assignee: zhijiang
>            Priority: Minor
>
> After task or TaskManager failure, the new execution attempt tries to take 
> the slot from prior execution by default. It can get benefits for state 
> recovery locality by RocksDB backend, and it actually makes sense for task 
> failure scenario.
> But for TaskManager failure scenario, the inside slot is recycled and can not 
> be reused any more. When the inside execution resets to allocate slot from 
> {{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to 
> match any other available slots by {{ResourceProfile}}. As a result, the 
> other parallel execution's slot will be occupied by this execution in failed 
> {{TaskManager}}, and all the following executions may not reuse the previous 
> slots any more. It will bring bad effects for state recovery.
> To solve this problem, we would like to request a new slot for re-deployment 
> when attached with an unavailable location, so it will not occupy the other 
> alive slots any more.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to