Hi all, Flink job failures may happen due to cluster node issues (insufficient disk space, bad hardware, network abnormalities). Flink will take care of the failures and redeploy the tasks. However, due to data locality and limited resources, the new tasks are very likely to be redeployed to the same nodes, which will result in continuous task abnormalities and affect job progress.
Currently, Flink users need to manually identify the problematic node and take it offline to solve this problem. But this approach has following disadvantages: 1. Taking a node offline can be a heavy process. Users may need to contact cluster administors to do this. The operation can even be dangerous and not allowed during some important business events. 2. Identifying and solving this kind of problems manually would be slow and a waste of human resources. To solve this problem, Zhu Zhu and I propose to introduce a blacklist mechanism for Flink to filter out problematic resources. You can find more details in FLIP-224[1]. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism Best, Lijie