Hi Konstantin,

Thanks for your feedback. I will response your 4 remarks:


1) Thanks for reminding me of the controversy. I think “BlockList” is good
enough, and I will change it in FLIP.


2) Your suggestion for the REST API is a good idea. Based on the above, I
would change REST API as following:

POST/GET <host>/blocklist/nodes

POST/GET <host>/blocklist/taskmanagers

DELETE <host>/blocklist/node/<identifier>

DELETE <host>/blocklist/taskmanager/<identifier>


3) If a node is blocking/blocklisted, it means that all task managers on
this node are blocklisted. All slots on these TMs are not available. This
is actually a bit like TM losts, but these TMs are not really lost, they
are in an unavailable status, and they are still registered in this flink
cluster. They will be available again once the corresponding blocklist item
is removed. This behavior is the same in active/non-active clusters.
However in the active clusters, these TMs may be released due to idle
timeouts.


4) For the item timeout, I prefer to keep it. The reasons are as following:

a) The timeout will not affect users adding or removing items via REST API,
and users can disable it by configuring it to Long.MAX_VALUE .

b) Some node problems can recover after a period of time (such as machine
hotspots), in which case users may prefer that Flink can do this
automatically instead of requiring the user to do it manually.


Best,

Lijie

Konstantin Knauf <kna...@apache.org> 于2022年4月27日周三 19:23写道:

> Hi Lijie,
>
> I think, this makes sense and +1 to only support manually blocking
> taskmanagers and nodes. Maybe the different strategies can also be
> maintained outside of Apache Flink.
>
> A few remarks:
>
> 1) Can we use another term than "bla.cklist" due to the controversy around
> the term? [1] There was also a Jira Ticket about this topic a while back
> and there was generally a consensus to avoid the term blacklist & whitelist
> [2]? We could use "blocklist" "denylist" or "quarantined"
> 2) For the REST API, I'd prefer a slightly different design as verbs like
> add/remove often considered an anti-pattern for REST APIs. POST on a list
> item is generally the standard to add items. DELETE on the individual
> resource is standard to remove an item.
>
> POST <host>/quarantine/items
> DELETE <host>/quarantine/items/<itemidentifier>
>
> We could also consider to separate taskmanagers and nodes in the REST API
> (and internal data structures). Any opinion on this?
>
> POST/GET <host>/quarantine/nodes
> POST/GET <host>/quarantine/taskmanager
> DELETE <host>/quarantine/nodes/<identifier>
> DELETE <host>/quarantine/taskmanager/<identifier>
>
> 3) How would blocking nodes behave with non-active resource managers, i.e.
> standalone or reactive mode?
>
> 4) To keep the implementation even more minimal, do we need the timeout
> behavior? If items are added/removed manually we could delegate this to the
> user easily. In my opinion the timeout behavior would better fit into
> specific strategies at a later point.
>
> Looking forward to your thoughts.
>
> Cheers and thank you,
>
> Konstantin
>
> [1]
>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> [2] https://issues.apache.org/jira/browse/FLINK-18209
>
> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> wangdachui9...@gmail.com>:
>
> > Hi all,
> >
> > Flink job failures may happen due to cluster node issues (insufficient
> disk
> > space, bad hardware, network abnormalities). Flink will take care of the
> > failures and redeploy the tasks. However, due to data locality and
> limited
> > resources, the new tasks are very likely to be redeployed to the same
> > nodes, which will result in continuous task abnormalities and affect job
> > progress.
> >
> > Currently, Flink users need to manually identify the problematic node and
> > take it offline to solve this problem. But this approach has following
> > disadvantages:
> >
> > 1. Taking a node offline can be a heavy process. Users may need to
> contact
> > cluster administors to do this. The operation can even be dangerous and
> not
> > allowed during some important business events.
> >
> > 2. Identifying and solving this kind of problems manually would be slow
> and
> > a waste of human resources.
> >
> > To solve this problem, Zhu Zhu and I propose to introduce a blacklist
> > mechanism for Flink to filter out problematic resources.
> >
> >
> > You can find more details in FLIP-224[1]. Looking forward to your
> feedback.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >
> >
> > Best,
> >
> > Lijie
> >
>

Reply via email to