Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Lijie Wang Fri, 06 May 2022 07:18:18 -0700

Thanks for your feedback, Jiangang and Martijn.

@Jiangang



> For auto-detecting, I wonder how to make the strategy and mark a node
blocked?

In fact, we currently plan to not support auto-detection in this FLIP. The
part about auto-detection may be continued in a separate FLIP in the
future. Some guys have the same concerns as you, and the correctness and
necessity of auto-detection may require further discussion in the future.

> In session mode, multi jobs can fail on the same bad node and the node
should be marked blocked.
By design, the blocklist information will be shared among all jobs in a
cluster/session. The JM will sync blocklist information with RM.

@Martijn

> I agree with Yang Wang on this.
As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just limits
the load of the node and does not  kill all the processes on it) is also
important, and we think that external systems (*yarn rmadmin or kubectl
taint*) cannot support it. So we think it makes sense even only *manually*.

> I also agree with Chesnay that magical mechanisms are indeed super hard
to get right.
Yes, as you see, Jiangang(and a few others) have the same concern.
However, we currently plan to not support auto-detection in this FLIP, and
only *manually*. In addition, I'd like to say that the FLIP provides a
mechanism to support MARK_BLOCKLISTED and MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
the auto-detection may be done by external systems.

Best,
Lijie

Martijn Visser <mart...@ververica.com> 于2022年5月6日周五 19:04写道：

> > If we only support to block nodes manually, then I could not see
> the obvious advantages compared with current SRE's approach(via *yarn
> rmadmin or kubectl taint*).
>
> I agree with Yang Wang on this.
>
> >  To me this sounds yet again like one of those magical mechanisms that
> will rarely work just right.
>
> I also agree with Chesnay that magical mechanisms are indeed super hard to
> get right.
>
> Best regards,
>
> Martijn
>
> On Fri, 6 May 2022 at 12:03, Jiangang Liu <liujiangangp...@gmail.com>
> wrote:
>
>> Thanks for the valuable design. The auto-detecting can decrease great work
>> for us. We have implemented the similar feature in our inner flink
>> version.
>> Below is something that I care about:
>>
>>    1. For auto-detecting, I wonder how to make the strategy and mark a
>> node
>>    blocked? Sometimes the blocked node is hard to be detected, for
>> example,
>>    the upper node or the down node will be blocked when network
>> unreachable.
>>    2. I see that the strategy is made in JobMaster side. How about
>>    implementing the similar logic in resource manager? In session mode,
>> multi
>>    jobs can fail on the same bad node and the node should be marked
>> blocked.
>>    If the job makes the strategy, the node may be not marked blocked if
>> the
>>    fail times don't exceed the threshold.
>>
>>
>> Zhu Zhu <reed...@gmail.com> 于2022年5月5日周四 23:35写道：
>>
>> > Thank you for all your feedback!
>> >
>> > Besides the answers from Lijie, I'd like to share some of my thoughts:
>> > 1. Whether to enable automatical blocklist
>> > Generally speaking, it is not a goal of FLIP-224.
>> > The automatical way should be something built upon the blocklist
>> > mechanism and well decoupled. It was designed to be a configurable
>> > blocklist strategy, but I think we can further decouple it by
>> > introducing a abnormal node detector, as Becket suggested, which just
>> > uses the blocklist mechanism once bad nodes are detected. However, it
>> > should be a separate FLIP with further dev discussions and feedback
>> > from users. I also agree with Becket that different users have different
>> > requirements, and we should listen to them.
>> >
>> > 2. Is it enough to just take away abnormal nodes externally
>> > My answer is no. As Lijie has mentioned, we need a way to avoid
>> > deploying tasks to temporary hot nodes. In this case, users may just
>> > want to limit the load of the node and do not want to kill all the
>> > processes on it. Another case is the speculative execution[1] which
>> > may also leverage this feature to avoid starting mirror tasks on slow
>> > nodes.
>> >
>> > Thanks,
>> > Zhu
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> >
>> > Lijie Wang <wangdachui9...@gmail.com> 于2022年5月5日周四 15:56写道：
>> >
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > > Thanks for your feedback.
>> > >
>> > >
>> > > There's one detail that I'd like to re-emphasize here because it can
>> > affect the value and design of the blocklist mechanism (perhaps I should
>> > highlight it in the FLIP). We propose two actions in FLIP:
>> > >
>> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked.
>> > Future slots should not be allocated from the blocked task manager or
>> node.
>> > But slots that are already allocated will not be affected. A typical
>> > application scenario is to mitigate machine hotspots. In this case, we
>> hope
>> > that subsequent resource allocations will not be on the hot machine, but
>> > tasks currently running on it should not be affected.
>> > >
>> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or node
>> as
>> > blocked, and evacuate all tasks on it. Evacuated tasks will be
>> restarted on
>> > non-blocked task managers.
>> > >
>> > > For the above 2 actions, the former may more highlight the meaning of
>> > this FLIP, because the external system cannot do that.
>> > >
>> > >
>> > > Regarding *Manually* and *Automatically*, I basically agree with
>> @Becket
>> > Qin: different users have different answers. Not all users’ deployment
>> > environments have a special external system that can perform the anomaly
>> > detection. In addition, adding pluggable/optional auto-detection doesn't
>> > require much extra work on top of manual specification.
>> > >
>> > >
>> > > I will answer your other questions one by one.
>> > >
>> > >
>> > > @Yangze
>> > >
>> > > a) I think you are right, we do not need to expose the
>> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
>> > >
>> > > b) We can abstract the `notifyException` to a separate interface
>> (maybe
>> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler can
>> > implement it in the future.
>> > >
>> > >
>> > > @Martijn
>> > >
>> > > a) I also think the manual blocking should be done by cluster
>> operators.
>> > >
>> > > b) I think manual blocking makes sense, because according to my
>> > experience, users are often the first to perceive the machine problems
>> > (because of job failover or delay), and they will contact cluster
>> operators
>> > to solve it, or even tell the cluster operators which machine is
>> > problematic. From this point of view, I think the people who really need
>> > the manual blocking are the users, and it’s just performed by the
>> cluster
>> > operator, so I think the manual blocking makes sense.
>> > >
>> > >
>> > > @Chesnay
>> > >
>> > > We need to touch the logic of JM/SlotPool, because for
>> MARK_BLOCKLISTED
>> > , we need to know whether the slot is blocklisted when the task is
>> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
>> > directly to avoid assigning other tasks (of this job) on it. If we only
>> > maintain the blocklist information on the RM, JM needs to retrieve it by
>> > RPC. I think the performance overhead of that is relatively large, so I
>> > think it's worth maintaining the blocklist information on the JM side
>> and
>> > syncing them.
>> > >
>> > >
>> > > @Роман
>> > >
>> > >     a) “Probably storing inside Zookeeper/Configmap might be helpful
>> > here.”  Can you explain it in detail? I don't fully understand that. In
>> my
>> > opinion, non-active and active are the same, and no special treatment is
>> > required.
>> > >
>> > > b) I agree with you, the `endTimestamp` makes sense, I will add it to
>> > FLIP.
>> > >
>> > >
>> > > @Yang
>> > >
>> > > As mentioned above, AFAK, the external system cannot support the
>> > MARK_BLOCKLISTED action.
>> > >
>> > >
>> > > Looking forward to your further feedback.
>> > >
>> > >
>> > > Best,
>> > >
>> > > Lijie
>> > >
>> > >
>> > > Yang Wang <danrtsey...@gmail.com> 于2022年5月3日周二 21:09写道：
>> > >>
>> > >> Thanks Lijie and Zhu for creating the proposal.
>> > >>
>> > >> I want to share some thoughts about Flink cluster operations.
>> > >>
>> > >> In the production environment, the SRE(aka Site Reliability Engineer)
>> > >> already has many tools to detect the unstable nodes, which could take
>> > the
>> > >> system logs/metrics into consideration.
>> > >> Then they use graceful-decomission in YARN and taint in K8s to
>> prevent
>> > new
>> > >> allocations on these unstable nodes.
>> > >> At last, they will evict all the containers and pods running on these
>> > nodes.
>> > >> This mechanism also works for planned maintenance. So I am afraid
>> this
>> > is
>> > >> not the typical use case for FLIP-224.
>> > >>
>> > >> If we only support to block nodes manually, then I could not see
>> > >> the obvious advantages compared with current SRE's approach(via *yarn
>> > >> rmadmin or kubectl taint*).
>> > >> At least, we need to have a pluggable component which could expose
>> the
>> > >> potential unstable nodes automatically and block them if enabled
>> > explicitly.
>> > >>
>> > >>
>> > >> Best,
>> > >> Yang
>> > >>
>> > >>
>> > >>
>> > >> Becket Qin <becket....@gmail.com> 于2022年5月2日周一 16:36写道：
>> > >>
>> > >> > Thanks for the proposal, Lijie.
>> > >> >
>> > >> > This is an interesting feature and discussion, and somewhat related
>> > to the
>> > >> > design principle about how people should operate Flink.
>> > >> >
>> > >> > I think there are three things involved in this FLIP.
>> > >> >      a) Detect and report the unstable node.
>> > >> >      b) Collect the information of the unstable node and form a
>> > blocklist.
>> > >> >      c) Take the action to block nodes.
>> > >> >
>> > >> > My two cents:
>> > >> >
>> > >> > 1. It looks like people all agree that Flink should have c). It is
>> > not only
>> > >> > useful for cases of node failures, but also handy for some planned
>> > >> > maintenance.
>> > >> >
>> > >> > 2. People have different opinions on b), i.e. who should be the
>> brain
>> > to
>> > >> > make the decision to block a node. I think this largely depends on
>> > who we
>> > >> > talk to. Different users would probably give different answers. For
>> > people
>> > >> > who do have a centralized node health management service, let Flink
>> > do just
>> > >> > do a) and c) would be preferred. So essentially Flink would be one
>> of
>> > the
>> > >> > sources that may detect unstable nodes, report it to that service,
>> > and then
>> > >> > take the command from that service to block the problematic nodes.
>> On
>> > the
>> > >> > other hand, for users who do not have such a service, simply
>> letting
>> > Flink
>> > >> > be clever by itself to block the suspicious nodes might be desired
>> to
>> > >> > ensure the jobs are running smoothly.
>> > >> >
>> > >> > So that indicates a) and b) here should be pluggable / optional.
>> > >> >
>> > >> > In light of this, maybe it would make sense to have something
>> > pluggable
>> > >> > like a UnstableNodeReporter which exposes unstable nodes actively.
>> (A
>> > more
>> > >> > general interface should be JobInfoReporter<T> which can be used to
>> > report
>> > >> > any information of type <T>. But I'll just keep the scope relevant
>> to
>> > this
>> > >> > FLIP here). Personally speaking, I think it is OK to have a default
>> > >> > implementation of a reporter which just tells Flink to take action
>> to
>> > block
>> > >> > problematic nodes and also unblocks them after timeout.
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Jiangjie (Becket) Qin
>> > >> >
>> > >> >
>> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro.v.bo...@gmail.com>
>> > wrote:
>> > >> >
>> > >> > > Thanks for good initiative, Lijie and Zhu!
>> > >> > >
>> > >> > > If it's possible I'd like to participate in development.
>> > >> > >
>> > >> > > I agree with 3rd point of Konstantin's reply - we should consider
>> > to move
>> > >> > > somehow the information of blocklisted nodes/TMs from active
>> > >> > > ResourceManager to non-active ones. Probably storing inside
>> > >> > > Zookeeper/Configmap might be helpful here.
>> > >> > >
>> > >> > > And I agree with Martijn that a lot of organizations don't want
>> to
>> > expose
>> > >> > > such API for a cluster user group. But I think it's necessary to
>> > have the
>> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
>> incorrect
>> > >> > > automatic behaviour.
>> > >> > >
>> > >> > > And another one small suggestion - I think it would be better to
>> > extend
>> > >> > the
>> > >> > > *BlocklistedItem* class with the *endTimestamp* field and fill it
>> > at the
>> > >> > > item creation. This simple addition will allow to:
>> > >> > >
>> > >> > >    -
>> > >> > >
>> > >> > >    Provide the ability to users to setup the exact time of
>> > blocklist end
>> > >> > >    through RestAPI
>> > >> > >    -
>> > >> > >
>> > >> > >    Not being tied to a single value of
>> > >> > >    *cluster.resource-blacklist.item.timeout*
>> > >> > >
>> > >> > >
>> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
>> ches...@apache.org>
>> > >> > wrote:
>> > >> > >
>> > >> > > > I do share the concern between blurring the lines a bit.
>> > >> > > >
>> > >> > > > That said, I'd prefer to not have any auto-detection and only
>> > have an
>> > >> > > > opt-in mechanism
>> > >> > > > to manually block processes/nodes. To me this sounds yet again
>> > like one
>> > >> > > > of those
>> > >> > > > magical mechanisms that will rarely work just right.
>> > >> > > > An external system can leverage way more information after all.
>> > >> > > >
>> > >> > > > Moreover, I'm quite concerned about the complexity of this
>> > proposal.
>> > >> > > > Tracking on both the RM/JM side; syncing between components;
>> > >> > adjustments
>> > >> > > > to the
>> > >> > > > slot and resource protocol.
>> > >> > > >
>> > >> > > > In a way it seems overly complicated.
>> > >> > > >
>> > >> > > > If we look at it purely from an active resource management
>> > perspective,
>> > >> > > > then there
>> > >> > > > isn't really a need to touch the slot protocol at all (or in
>> fact
>> > to
>> > >> > > > anything in the JobMaster),
>> > >> > > > because there isn't any point in keeping around blocked TMs in
>> the
>> > >> > first
>> > >> > > > place.
>> > >> > > > They'd just be idling, potentially shutting down after a while
>> by
>> > the
>> > >> > RM
>> > >> > > > because of
>> > >> > > > it (unless we _also_ touch that logic).
>> > >> > > > Here the blocking of a process (be it by blocking the process
>> or
>> > node)
>> > >> > is
>> > >> > > > equivalent with shutting down the blocked process(es).
>> > >> > > > Once the block is lifted we can just spin it back up.
>> > >> > > >
>> > >> > > > And I do wonder whether we couldn't apply the same line of
>> > thinking to
>> > >> > > > standalone resource management.
>> > >> > > > Here being able to stop/restart a process/node manually should
>> be
>> > a
>> > >> > core
>> > >> > > > requirement for a Flink deployment anyway.
>> > >> > > >
>> > >> > > >
>> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
>> > >> > > > > Hi everyone,
>> > >> > > > >
>> > >> > > > > Thanks for creating this FLIP. I can understand the problem
>> and
>> > I see
>> > >> > > > value
>> > >> > > > > in the automatic detection and blocklisting. I do have some
>> > concerns
>> > >> > > with
>> > >> > > > > the ability to manually specify to be blocked resources. I
>> have
>> > two
>> > >> > > > > concerns;
>> > >> > > > >
>> > >> > > > > * Most organizations explicitly have a separation of
>> concerns,
>> > >> > meaning
>> > >> > > > that
>> > >> > > > > there's a group who's responsible for managing a cluster and
>> > there's
>> > >> > a
>> > >> > > > user
>> > >> > > > > group who uses that cluster. With the introduction of this
>> > mechanism,
>> > >> > > the
>> > >> > > > > latter group now can influence the responsibility of the
>> first
>> > group.
>> > >> > > So
>> > >> > > > it
>> > >> > > > > can be possible that someone from the user group blocks
>> > something,
>> > >> > > which
>> > >> > > > > causes an outage (which could result in paging mechanism
>> > triggering
>> > >> > > etc)
>> > >> > > > > which impacts the first group.
>> > >> > > > > * How big is the group of people who can go through the
>> process
>> > of
>> > >> > > > manually
>> > >> > > > > identifying a node that isn't behaving as it should be? I do
>> > think
>> > >> > this
>> > >> > > > > group is relatively limited. Does it then make sense to
>> > introduce
>> > >> > such
>> > >> > > a
>> > >> > > > > feature, which would only be used by a really small user
>> group
>> > of
>> > >> > > Flink?
>> > >> > > > We
>> > >> > > > > still have to maintain, test and support such a feature.
>> > >> > > > >
>> > >> > > > > I'm +1 for the autodetection features, but I'm leaning
>> towards
>> > not
>> > >> > > > exposing
>> > >> > > > > this to the user group but having this available strictly for
>> > cluster
>> > >> > > > > operators. They could then also set up their
>> > paging/metrics/logging
>> > >> > > > system
>> > >> > > > > to take this into account.
>> > >> > > > >
>> > >> > > > > Best regards,
>> > >> > > > >
>> > >> > > > > Martijn Visser
>> > >> > > > > https://twitter.com/MartijnVisser82
>> > >> > > > > https://github.com/MartijnVisser
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <karma...@gmail.com
>> >
>> > wrote:
>> > >> > > > >
>> > >> > > > >> Thanks for driving this, Zhu and Lijie.
>> > >> > > > >>
>> > >> > > > >> +1 for the overall proposal. Just share some cents here:
>> > >> > > > >>
>> > >> > > > >> - Why do we need to expose
>> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval to
>> the
>> > user?
>> > >> > > > >> I think the semantics of
>> > `cluster.resource-blacklist.item.timeout`
>> > >> > is
>> > >> > > > >> sufficient for the user. How to guarantee the timeout
>> > mechanism is
>> > >> > > > >> Flink's internal implementation. I think it will be very
>> > confusing
>> > >> > and
>> > >> > > > >> we do not need to expose it to users.
>> > >> > > > >>
>> > >> > > > >> - ResourceManager can notify the exception of a task
>> manager to
>> > >> > > > >> `BlacklistHandler` as well.
>> > >> > > > >> For example, the slot allocation might fail in case the
>> target
>> > task
>> > >> > > > >> manager is busy or has a network jitter. I don't mean we
>> need
>> > to
>> > >> > cover
>> > >> > > > >> this case in this version, but we can also open a
>> > `notifyException`
>> > >> > in
>> > >> > > > >> `ResourceManagerBlacklistHandler`.
>> > >> > > > >>
>> > >> > > > >> - Before we sync the blocklist to ResourceManager, will the
>> > slot of
>> > >> > a
>> > >> > > > >> blocked task manager continues to be released and allocated?
>> > >> > > > >>
>> > >> > > > >> Best,
>> > >> > > > >> Yangze Guo
>> > >> > > > >>
>> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
>> > >> > wangdachui9...@gmail.com>
>> > >> > > > >> wrote:
>> > >> > > > >>> Hi Konstantin,
>> > >> > > > >>>
>> > >> > > > >>> Thanks for your feedback. I will response your 4 remarks:
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think
>> > “BlockList”
>> > >> > is
>> > >> > > > >> good
>> > >> > > > >>> enough, and I will change it in FLIP.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 2) Your suggestion for the REST API is a good idea. Based
>> on
>> > the
>> > >> > > > above, I
>> > >> > > > >>> would change REST API as following:
>> > >> > > > >>>
>> > >> > > > >>> POST/GET <host>/blocklist/nodes
>> > >> > > > >>>
>> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
>> > >> > > > >>>
>> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
>> > >> > > > >>>
>> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all
>> task
>> > >> > managers
>> > >> > > > on
>> > >> > > > >>> this node are blocklisted. All slots on these TMs are not
>> > >> > available.
>> > >> > > > This
>> > >> > > > >>> is actually a bit like TM losts, but these TMs are not
>> really
>> > lost,
>> > >> > > > they
>> > >> > > > >>> are in an unavailable status, and they are still registered
>> > in this
>> > >> > > > flink
>> > >> > > > >>> cluster. They will be available again once the
>> corresponding
>> > >> > > blocklist
>> > >> > > > >> item
>> > >> > > > >>> is removed. This behavior is the same in active/non-active
>> > >> > clusters.
>> > >> > > > >>> However in the active clusters, these TMs may be released
>> due
>> > to
>> > >> > idle
>> > >> > > > >>> timeouts.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons
>> are
>> > as
>> > >> > > > >> following:
>> > >> > > > >>> a) The timeout will not affect users adding or removing
>> items
>> > via
>> > >> > > REST
>> > >> > > > >> API,
>> > >> > > > >>> and users can disable it by configuring it to
>> Long.MAX_VALUE .
>> > >> > > > >>>
>> > >> > > > >>> b) Some node problems can recover after a period of time
>> > (such as
>> > >> > > > machine
>> > >> > > > >>> hotspots), in which case users may prefer that Flink can do
>> > this
>> > >> > > > >>> automatically instead of requiring the user to do it
>> manually.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> Best,
>> > >> > > > >>>
>> > >> > > > >>> Lijie
>> > >> > > > >>>
>> > >> > > > >>> Konstantin Knauf <kna...@apache.org> 于2022年4月27日周三
>> 19:23写道：
>> > >> > > > >>>
>> > >> > > > >>>> Hi Lijie,
>> > >> > > > >>>>
>> > >> > > > >>>> I think, this makes sense and +1 to only support manually
>> > blocking
>> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies can
>> > also be
>> > >> > > > >>>> maintained outside of Apache Flink.
>> > >> > > > >>>>
>> > >> > > > >>>> A few remarks:
>> > >> > > > >>>>
>> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
>> > >> > controversy
>> > >> > > > >> around
>> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this
>> topic a
>> > >> > while
>> > >> > > > >> back
>> > >> > > > >>>> and there was generally a consensus to avoid the term
>> > blacklist &
>> > >> > > > >> whitelist
>> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
>> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different
>> design
>> > as
>> > >> > verbs
>> > >> > > > >> like
>> > >> > > > >>>> add/remove often considered an anti-pattern for REST APIs.
>> > POST
>> > >> > on a
>> > >> > > > >> list
>> > >> > > > >>>> item is generally the standard to add items. DELETE on the
>> > >> > > individual
>> > >> > > > >>>> resource is standard to remove an item.
>> > >> > > > >>>>
>> > >> > > > >>>> POST <host>/quarantine/items
>> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
>> > >> > > > >>>>
>> > >> > > > >>>> We could also consider to separate taskmanagers and nodes
>> in
>> > the
>> > >> > > REST
>> > >> > > > >> API
>> > >> > > > >>>> (and internal data structures). Any opinion on this?
>> > >> > > > >>>>
>> > >> > > > >>>> POST/GET <host>/quarantine/nodes
>> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
>> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
>> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
>> > >> > > > >>>>
>> > >> > > > >>>> 3) How would blocking nodes behave with non-active
>> resource
>> > >> > > managers,
>> > >> > > > >> i.e.
>> > >> > > > >>>> standalone or reactive mode?
>> > >> > > > >>>>
>> > >> > > > >>>> 4) To keep the implementation even more minimal, do we
>> need
>> > the
>> > >> > > > timeout
>> > >> > > > >>>> behavior? If items are added/removed manually we could
>> > delegate
>> > >> > this
>> > >> > > > >> to the
>> > >> > > > >>>> user easily. In my opinion the timeout behavior would
>> better
>> > fit
>> > >> > > into
>> > >> > > > >>>> specific strategies at a later point.
>> > >> > > > >>>>
>> > >> > > > >>>> Looking forward to your thoughts.
>> > >> > > > >>>>
>> > >> > > > >>>> Cheers and thank you,
>> > >> > > > >>>>
>> > >> > > > >>>> Konstantin
>> > >> > > > >>>>
>> > >> > > > >>>> [1]
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
>> > >> > > > >>>>
>> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
>> > >> > > > >>>> wangdachui9...@gmail.com>:
>> > >> > > > >>>>
>> > >> > > > >>>>> Hi all,
>> > >> > > > >>>>>
>> > >> > > > >>>>> Flink job failures may happen due to cluster node issues
>> > >> > > > >> (insufficient
>> > >> > > > >>>> disk
>> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink will
>> > take care
>> > >> > > of
>> > >> > > > >> the
>> > >> > > > >>>>> failures and redeploy the tasks. However, due to data
>> > locality
>> > >> > and
>> > >> > > > >>>> limited
>> > >> > > > >>>>> resources, the new tasks are very likely to be redeployed
>> > to the
>> > >> > > same
>> > >> > > > >>>>> nodes, which will result in continuous task abnormalities
>> > and
>> > >> > > affect
>> > >> > > > >> job
>> > >> > > > >>>>> progress.
>> > >> > > > >>>>>
>> > >> > > > >>>>> Currently, Flink users need to manually identify the
>> > problematic
>> > >> > > > >> node and
>> > >> > > > >>>>> take it offline to solve this problem. But this approach
>> has
>> > >> > > > >> following
>> > >> > > > >>>>> disadvantages:
>> > >> > > > >>>>>
>> > >> > > > >>>>> 1. Taking a node offline can be a heavy process. Users
>> may
>> > need
>> > >> > to
>> > >> > > > >>>> contact
>> > >> > > > >>>>> cluster administors to do this. The operation can even be
>> > >> > dangerous
>> > >> > > > >> and
>> > >> > > > >>>> not
>> > >> > > > >>>>> allowed during some important business events.
>> > >> > > > >>>>>
>> > >> > > > >>>>> 2. Identifying and solving this kind of problems manually
>> > would
>> > >> > be
>> > >> > > > >> slow
>> > >> > > > >>>> and
>> > >> > > > >>>>> a waste of human resources.
>> > >> > > > >>>>>
>> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
>> introduce a
>> > >> > > blacklist
>> > >> > > > >>>>> mechanism for Flink to filter out problematic resources.
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking forward
>> > to your
>> > >> > > > >>>> feedback.
>> > >> > > > >>>>> [1]
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> > >> > > > >>>>>
>> > >> > > > >>>>> Best,
>> > >> > > > >>>>>
>> > >> > > > >>>>> Lijie
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Reply via email to