Hi all,

Flink job failures may happen due to cluster node issues (insufficient disk
space, bad hardware, network abnormalities). Flink will take care of the
failures and redeploy the tasks. However, due to data locality and limited
resources, the new tasks are very likely to be redeployed to the same
nodes, which will result in continuous task abnormalities and affect job
progress.

Currently, Flink users need to manually identify the problematic node and
take it offline to solve this problem. But this approach has following
disadvantages:

1. Taking a node offline can be a heavy process. Users may need to contact
cluster administors to do this. The operation can even be dangerous and not
allowed during some important business events.

2. Identifying and solving this kind of problems manually would be slow and
a waste of human resources.

To solve this problem, Zhu Zhu and I propose to introduce a blacklist
mechanism for Flink to filter out problematic resources.


You can find more details in FLIP-224[1]. Looking forward to your feedback.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism


Best,

Lijie

Reply via email to