Hi everyone, This post proposes the blacklist mechanism as an enhancement of flink scheduler. The motivation is as follows.
In our clusters, jobs encounter Hardware and software environment problems occasionally, including software library missing,bad hardware,resource shortage like out of disk space,these problems will lead to task failure,the failover strategy will take care of that and redeploy the relevant tasks. But because of reasons like location preference and limited total resources,the failed task will be scheduled to be deployed on the same host, then the task will fail again and again, many times. The primary cause of this problem is the mismatching of task and resource. Currently, the resource allocation algorithm does not take these into consideration. We introduce the blacklist mechanism to solve this problem. The basic idea is that when a task fails too many times on some resource, the Scheduler will not assign the resource to that task. We have implemented this feature in our inner version of flink, and currently, it works fine. The following is the design draft, we would really appreciate it if you can review and comment. https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw Best, Yingjie -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/