Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Lijie Wang Tue, 10 May 2022 02:15:27 -0700

Hi Becket Qin,

Thanks for your suggestions.  I have moved the description of
configurations, metrics and REST API into "Public Interface" section, and
made a few updates according to your suggestion.  And in this FLIP, there
no public java Interfaces or pluggables that users need to implement by
themselves.


Answers for you questions:
1. Yes, there 2 block actions: MARK_BLOCKED and.
MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items can
only be added through the REST API, so these 2 action are mentioned in the
REST API part (The REST API part has beed moved to public interface now).
2. I agree with you. I have changed the "Cause" field to String, and allow
users to specify it via REST API.
3. Yes, it is useful to allow different timeouts. As mentioned above, we
will introduce 2 fields : *timeout* and *endTimestamp* into the ADD REST
API to specify when to remove the blocked item. These 2 fields are
optional, if neither is specified, it means that the blocked item is
permanent and will not be removed. If both are specified, the minimum of
*currentTimestamp+tiemout *and* endTimestamp* will be used as the time to
remove the blocked item. To keep the configurations more minimal, we have
removed the *cluster.resource-blocklist.item.timeout* configuration option.
4. Yes, the block item will be overridden if the specified item already
exists. The ADD operation is *ADD or UPDATE*.
5. Yes. On JM/RM side, all the blocklist information is maintained in
JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or abstracted
to other interfaces) will be propagated to different components.

Best,
Lijie

Becket Qin <[email protected]> 于2022年5月10日周二 11:26写道：

> Hi Lijie,
>
> Thanks for updating the FLIP. It looks like the public interface section
> did not fully reflect all the user sensible behavior and API. Can you put
> everything that users may be aware of there? That would include the REST
> API, metrics, configurations, public java Interfaces or pluggables that
> users may see or implement by themselves, as well as a brief summary of the
> behavior of the public API.
>
> Besides that, I have a few questions:
>
> 1. According to the conversation in the discussion thread, it looks like
> the BlockAction will have "MARK_BLOCKLISTED" and
> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can you add
> that to the public interface as well?
>
> 2. At this point, the "Cause" field in the BlockingItem is a Throwable and
> is not reflected in the REST API. Should that be included in the query
> response? And should we change that field to be a String so users may
> specify the cause via the REST API when they block some nodes / TMs?
>
> 3. Would it be useful to allow users to have different timeouts for
> different blocked items? So while there is a default timeout, users can
> also override it via the REST API when they block an entity.
>
> 4. Regarding the ADD operation, if the specified item is already there,
> will the block item be overridden? For example, if the user wants to extend
> the timeout of a blocked item, can they just  issue an ADD command again?
>
> 5. I am not quite familiar with the details of this, but is there a source
> of truth for the blocked list? I think it might be good to have a single
> source of truth for the blocked list and just propagate that list to
> different components to take the action of actually blocking the resource.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> > Based on the discussion in the mailing list, I updated the FLIP doc, the
> > changes include:
> > 1. Changed the description of the motivation section to more clearly
> > describe the problem this FLIP is trying to solve.
> > 2. Only  *Manually* is supported.
> > 3. Adopted some suggestions, such as *endTimestamp*.
> >
> > Best,
> > Lijie
> >
> >
> > Roman Boyko <[email protected]> 于2022年5月7日周六 19:25写道：
> >
> > > Hi Lijie!
> > >
> > >
> > >
> > >
> > > *a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.”
> > > Can you explain it in detail? I don't fully understand that. In
> > myopinion,
> > > non-active and active are the same, and no special treatment
> isrequired.*
> > >
> > > Sorry this was a misunderstanding from my side. I thought we were
> talking
> > > about the HA mode (but not about Active and Standalone
> ResourceManager).
> > > And the original question was - how to handle the blacklisted nodes
> list
> > at
> > > the moment of leader change? Should we simply forget about them or try
> to
> > > pre-save that list on the remote storage?
> > >
> > > On Sat, 7 May 2022 at 10:51, Yang Wang <[email protected]> wrote:
> > >
> > > > Thanks Lijie and ZhuZhu for the explanation.
> > > >
> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is
> indeed
> > > > some functionalities the external tools(e.g. kubectl taint) could not
> > > > support.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Lijie Wang <[email protected]> 于2022年5月6日周五 22:18写道：
> > > >
> > > > > Thanks for your feedback, Jiangang and Martijn.
> > > > >
> > > > > @Jiangang
> > > > >
> > > > >
> > > > > > For auto-detecting, I wonder how to make the strategy and mark a
> > node
> > > > > blocked?
> > > > >
> > > > > In fact, we currently plan to not support auto-detection in this
> > FLIP.
> > > > The
> > > > > part about auto-detection may be continued in a separate FLIP in
> the
> > > > > future. Some guys have the same concerns as you, and the
> correctness
> > > and
> > > > > necessity of auto-detection may require further discussion in the
> > > future.
> > > > >
> > > > > > In session mode, multi jobs can fail on the same bad node and the
> > > node
> > > > > should be marked blocked.
> > > > > By design, the blocklist information will be shared among all jobs
> > in a
> > > > > cluster/session. The JM will sync blocklist information with RM.
> > > > >
> > > > > @Martijn
> > > > >
> > > > > > I agree with Yang Wang on this.
> > > > > As Zhu Zhu and I mentioned above, we think the
> MARK_BLOCKLISTED(Just
> > > > limits
> > > > > the load of the node and does not  kill all the processes on it) is
> > > also
> > > > > important, and we think that external systems (*yarn rmadmin or
> > kubectl
> > > > > taint*) cannot support it. So we think it makes sense even only
> > > > *manually*.
> > > > >
> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> super
> > > hard
> > > > > to get right.
> > > > > Yes, as you see, Jiangang(and a few others) have the same concern.
> > > > > However, we currently plan to not support auto-detection in this
> > FLIP,
> > > > and
> > > > > only *manually*. In addition, I'd like to say that the FLIP
> provides
> > a
> > > > > mechanism to support MARK_BLOCKLISTED and
> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > > > the auto-detection may be done by external systems.
> > > > >
> > > > > Best,
> > > > > Lijie
> > > > >
> > > > > Martijn Visser <[email protected]> 于2022年5月6日周五 19:04写道：
> > > > >
> > > > > > > If we only support to block nodes manually, then I could not
> see
> > > > > > the obvious advantages compared with current SRE's approach(via
> > *yarn
> > > > > > rmadmin or kubectl taint*).
> > > > > >
> > > > > > I agree with Yang Wang on this.
> > > > > >
> > > > > > >  To me this sounds yet again like one of those magical
> mechanisms
> > > > that
> > > > > > will rarely work just right.
> > > > > >
> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> super
> > > hard
> > > > > to
> > > > > > get right.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Martijn
> > > > > >
> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks for the valuable design. The auto-detecting can decrease
> > > great
> > > > > work
> > > > > >> for us. We have implemented the similar feature in our inner
> flink
> > > > > >> version.
> > > > > >> Below is something that I care about:
> > > > > >>
> > > > > >>    1. For auto-detecting, I wonder how to make the strategy and
> > > mark a
> > > > > >> node
> > > > > >>    blocked? Sometimes the blocked node is hard to be detected,
> for
> > > > > >> example,
> > > > > >>    the upper node or the down node will be blocked when network
> > > > > >> unreachable.
> > > > > >>    2. I see that the strategy is made in JobMaster side. How
> about
> > > > > >>    implementing the similar logic in resource manager? In
> session
> > > > mode,
> > > > > >> multi
> > > > > >>    jobs can fail on the same bad node and the node should be
> > marked
> > > > > >> blocked.
> > > > > >>    If the job makes the strategy, the node may be not marked
> > blocked
> > > > if
> > > > > >> the
> > > > > >>    fail times don't exceed the threshold.
> > > > > >>
> > > > > >>
> > > > > >> Zhu Zhu <[email protected]> 于2022年5月5日周四 23:35写道：
> > > > > >>
> > > > > >> > Thank you for all your feedback!
> > > > > >> >
> > > > > >> > Besides the answers from Lijie, I'd like to share some of my
> > > > thoughts:
> > > > > >> > 1. Whether to enable automatical blocklist
> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
> > > > > >> > The automatical way should be something built upon the
> blocklist
> > > > > >> > mechanism and well decoupled. It was designed to be a
> > configurable
> > > > > >> > blocklist strategy, but I think we can further decouple it by
> > > > > >> > introducing a abnormal node detector, as Becket suggested,
> which
> > > > just
> > > > > >> > uses the blocklist mechanism once bad nodes are detected.
> > However,
> > > > it
> > > > > >> > should be a separate FLIP with further dev discussions and
> > > feedback
> > > > > >> > from users. I also agree with Becket that different users have
> > > > > different
> > > > > >> > requirements, and we should listen to them.
> > > > > >> >
> > > > > >> > 2. Is it enough to just take away abnormal nodes externally
> > > > > >> > My answer is no. As Lijie has mentioned, we need a way to
> avoid
> > > > > >> > deploying tasks to temporary hot nodes. In this case, users
> may
> > > just
> > > > > >> > want to limit the load of the node and do not want to kill all
> > the
> > > > > >> > processes on it. Another case is the speculative execution[1]
> > > which
> > > > > >> > may also leverage this feature to avoid starting mirror tasks
> on
> > > > slow
> > > > > >> > nodes.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Zhu
> > > > > >> >
> > > > > >> > [1]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > > > >> >
> > > > > >> > Lijie Wang <[email protected]> 于2022年5月5日周四 15:56写道：
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Hi everyone,
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Thanks for your feedback.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > There's one detail that I'd like to re-emphasize here
> because
> > it
> > > > can
> > > > > >> > affect the value and design of the blocklist mechanism
> (perhaps
> > I
> > > > > should
> > > > > >> > highlight it in the FLIP). We propose two actions in FLIP:
> > > > > >> > >
> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
> > > > blocked.
> > > > > >> > Future slots should not be allocated from the blocked task
> > manager
> > > > or
> > > > > >> node.
> > > > > >> > But slots that are already allocated will not be affected. A
> > > typical
> > > > > >> > application scenario is to mitigate machine hotspots. In this
> > > case,
> > > > we
> > > > > >> hope
> > > > > >> > that subsequent resource allocations will not be on the hot
> > > machine,
> > > > > but
> > > > > >> > tasks currently running on it should not be affected.
> > > > > >> > >
> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task
> manager
> > or
> > > > > node
> > > > > >> as
> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> > > > > >> restarted on
> > > > > >> > non-blocked task managers.
> > > > > >> > >
> > > > > >> > > For the above 2 actions, the former may more highlight the
> > > meaning
> > > > > of
> > > > > >> > this FLIP, because the external system cannot do that.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Regarding *Manually* and *Automatically*, I basically agree
> > with
> > > > > >> @Becket
> > > > > >> > Qin: different users have different answers. Not all users’
> > > > deployment
> > > > > >> > environments have a special external system that can perform
> the
> > > > > anomaly
> > > > > >> > detection. In addition, adding pluggable/optional
> auto-detection
> > > > > doesn't
> > > > > >> > require much extra work on top of manual specification.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > I will answer your other questions one by one.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Yangze
> > > > > >> > >
> > > > > >> > > a) I think you are right, we do not need to expose the
> > > > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to
> > users.
> > > > > >> > >
> > > > > >> > > b) We can abstract the `notifyException` to a separate
> > interface
> > > > > >> (maybe
> > > > > >> > BlocklistExceptionListener), and the
> > > ResourceManagerBlocklistHandler
> > > > > can
> > > > > >> > implement it in the future.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Martijn
> > > > > >> > >
> > > > > >> > > a) I also think the manual blocking should be done by
> cluster
> > > > > >> operators.
> > > > > >> > >
> > > > > >> > > b) I think manual blocking makes sense, because according to
> > my
> > > > > >> > experience, users are often the first to perceive the machine
> > > > problems
> > > > > >> > (because of job failover or delay), and they will contact
> > cluster
> > > > > >> operators
> > > > > >> > to solve it, or even tell the cluster operators which machine
> is
> > > > > >> > problematic. From this point of view, I think the people who
> > > really
> > > > > need
> > > > > >> > the manual blocking are the users, and it’s just performed by
> > the
> > > > > >> cluster
> > > > > >> > operator, so I think the manual blocking makes sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Chesnay
> > > > > >> > >
> > > > > >> > > We need to touch the logic of JM/SlotPool, because for
> > > > > >> MARK_BLOCKLISTED
> > > > > >> > , we need to know whether the slot is blocklisted when the
> task
> > is
> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the
> > > slot
> > > > > >> > directly to avoid assigning other tasks (of this job) on it.
> If
> > we
> > > > > only
> > > > > >> > maintain the blocklist information on the RM, JM needs to
> > retrieve
> > > > it
> > > > > by
> > > > > >> > RPC. I think the performance overhead of that is relatively
> > large,
> > > > so
> > > > > I
> > > > > >> > think it's worth maintaining the blocklist information on the
> JM
> > > > side
> > > > > >> and
> > > > > >> > syncing them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Роман
> > > > > >> > >
> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap might be
> > > > helpful
> > > > > >> > here.”  Can you explain it in detail? I don't fully understand
> > > that.
> > > > > In
> > > > > >> my
> > > > > >> > opinion, non-active and active are the same, and no special
> > > > treatment
> > > > > is
> > > > > >> > required.
> > > > > >> > >
> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I will
> > add
> > > it
> > > > > to
> > > > > >> > FLIP.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Yang
> > > > > >> > >
> > > > > >> > > As mentioned above, AFAK, the external system cannot support
> > the
> > > > > >> > MARK_BLOCKLISTED action.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Looking forward to your further feedback.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Best,
> > > > > >> > >
> > > > > >> > > Lijie
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Yang Wang <[email protected]> 于2022年5月3日周二 21:09写道：
> > > > > >> > >>
> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > > > > >> > >>
> > > > > >> > >> I want to share some thoughts about Flink cluster
> operations.
> > > > > >> > >>
> > > > > >> > >> In the production environment, the SRE(aka Site Reliability
> > > > > Engineer)
> > > > > >> > >> already has many tools to detect the unstable nodes, which
> > > could
> > > > > take
> > > > > >> > the
> > > > > >> > >> system logs/metrics into consideration.
> > > > > >> > >> Then they use graceful-decomission in YARN and taint in K8s
> > to
> > > > > >> prevent
> > > > > >> > new
> > > > > >> > >> allocations on these unstable nodes.
> > > > > >> > >> At last, they will evict all the containers and pods
> running
> > on
> > > > > these
> > > > > >> > nodes.
> > > > > >> > >> This mechanism also works for planned maintenance. So I am
> > > afraid
> > > > > >> this
> > > > > >> > is
> > > > > >> > >> not the typical use case for FLIP-224.
> > > > > >> > >>
> > > > > >> > >> If we only support to block nodes manually, then I could
> not
> > > see
> > > > > >> > >> the obvious advantages compared with current SRE's
> > approach(via
> > > > > *yarn
> > > > > >> > >> rmadmin or kubectl taint*).
> > > > > >> > >> At least, we need to have a pluggable component which could
> > > > expose
> > > > > >> the
> > > > > >> > >> potential unstable nodes automatically and block them if
> > > enabled
> > > > > >> > explicitly.
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> Best,
> > > > > >> > >> Yang
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> Becket Qin <[email protected]> 于2022年5月2日周一 16:36写道：
> > > > > >> > >>
> > > > > >> > >> > Thanks for the proposal, Lijie.
> > > > > >> > >> >
> > > > > >> > >> > This is an interesting feature and discussion, and
> somewhat
> > > > > related
> > > > > >> > to the
> > > > > >> > >> > design principle about how people should operate Flink.
> > > > > >> > >> >
> > > > > >> > >> > I think there are three things involved in this FLIP.
> > > > > >> > >> >      a) Detect and report the unstable node.
> > > > > >> > >> >      b) Collect the information of the unstable node and
> > > form a
> > > > > >> > blocklist.
> > > > > >> > >> >      c) Take the action to block nodes.
> > > > > >> > >> >
> > > > > >> > >> > My two cents:
> > > > > >> > >> >
> > > > > >> > >> > 1. It looks like people all agree that Flink should have
> > c).
> > > It
> > > > > is
> > > > > >> > not only
> > > > > >> > >> > useful for cases of node failures, but also handy for
> some
> > > > > planned
> > > > > >> > >> > maintenance.
> > > > > >> > >> >
> > > > > >> > >> > 2. People have different opinions on b), i.e. who should
> be
> > > the
> > > > > >> brain
> > > > > >> > to
> > > > > >> > >> > make the decision to block a node. I think this largely
> > > depends
> > > > > on
> > > > > >> > who we
> > > > > >> > >> > talk to. Different users would probably give different
> > > answers.
> > > > > For
> > > > > >> > people
> > > > > >> > >> > who do have a centralized node health management service,
> > let
> > > > > Flink
> > > > > >> > do just
> > > > > >> > >> > do a) and c) would be preferred. So essentially Flink
> would
> > > be
> > > > > one
> > > > > >> of
> > > > > >> > the
> > > > > >> > >> > sources that may detect unstable nodes, report it to that
> > > > > service,
> > > > > >> > and then
> > > > > >> > >> > take the command from that service to block the
> problematic
> > > > > nodes.
> > > > > >> On
> > > > > >> > the
> > > > > >> > >> > other hand, for users who do not have such a service,
> > simply
> > > > > >> letting
> > > > > >> > Flink
> > > > > >> > >> > be clever by itself to block the suspicious nodes might
> be
> > > > > desired
> > > > > >> to
> > > > > >> > >> > ensure the jobs are running smoothly.
> > > > > >> > >> >
> > > > > >> > >> > So that indicates a) and b) here should be pluggable /
> > > > optional.
> > > > > >> > >> >
> > > > > >> > >> > In light of this, maybe it would make sense to have
> > something
> > > > > >> > pluggable
> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> > > > > actively.
> > > > > >> (A
> > > > > >> > more
> > > > > >> > >> > general interface should be JobInfoReporter<T> which can
> be
> > > > used
> > > > > to
> > > > > >> > report
> > > > > >> > >> > any information of type <T>. But I'll just keep the scope
> > > > > relevant
> > > > > >> to
> > > > > >> > this
> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK to
> have a
> > > > > default
> > > > > >> > >> > implementation of a reporter which just tells Flink to
> take
> > > > > action
> > > > > >> to
> > > > > >> > block
> > > > > >> > >> > problematic nodes and also unblocks them after timeout.
> > > > > >> > >> >
> > > > > >> > >> > Thanks,
> > > > > >> > >> >
> > > > > >> > >> > Jiangjie (Becket) Qin
> > > > > >> > >> >
> > > > > >> > >> >
> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > > > [email protected]
> > > > > >
> > > > > >> > wrote:
> > > > > >> > >> >
> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > > > > >> > >> > >
> > > > > >> > >> > > If it's possible I'd like to participate in
> development.
> > > > > >> > >> > >
> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we
> should
> > > > > consider
> > > > > >> > to move
> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs from
> > > active
> > > > > >> > >> > > ResourceManager to non-active ones. Probably storing
> > inside
> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > > > > >> > >> > >
> > > > > >> > >> > > And I agree with Martijn that a lot of organizations
> > don't
> > > > want
> > > > > >> to
> > > > > >> > expose
> > > > > >> > >> > > such API for a cluster user group. But I think it's
> > > necessary
> > > > > to
> > > > > >> > have the
> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for
> > avoiding
> > > > > >> incorrect
> > > > > >> > >> > > automatic behaviour.
> > > > > >> > >> > >
> > > > > >> > >> > > And another one small suggestion - I think it would be
> > > better
> > > > > to
> > > > > >> > extend
> > > > > >> > >> > the
> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field
> and
> > > > fill
> > > > > it
> > > > > >> > at the
> > > > > >> > >> > > item creation. This simple addition will allow to:
> > > > > >> > >> > >
> > > > > >> > >> > >    -
> > > > > >> > >> > >
> > > > > >> > >> > >    Provide the ability to users to setup the exact time
> > of
> > > > > >> > blocklist end
> > > > > >> > >> > >    through RestAPI
> > > > > >> > >> > >    -
> > > > > >> > >> > >
> > > > > >> > >> > >    Not being tied to a single value of
> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > > > > >> > >> > >
> > > > > >> > >> > >
> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > > > > >> [email protected]>
> > > > > >> > >> > wrote:
> > > > > >> > >> > >
> > > > > >> > >> > > > I do share the concern between blurring the lines a
> > bit.
> > > > > >> > >> > > >
> > > > > >> > >> > > > That said, I'd prefer to not have any auto-detection
> > and
> > > > only
> > > > > >> > have an
> > > > > >> > >> > > > opt-in mechanism
> > > > > >> > >> > > > to manually block processes/nodes. To me this sounds
> > yet
> > > > > again
> > > > > >> > like one
> > > > > >> > >> > > > of those
> > > > > >> > >> > > > magical mechanisms that will rarely work just right.
> > > > > >> > >> > > > An external system can leverage way more information
> > > after
> > > > > all.
> > > > > >> > >> > > >
> > > > > >> > >> > > > Moreover, I'm quite concerned about the complexity of
> > > this
> > > > > >> > proposal.
> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing between
> > > > components;
> > > > > >> > >> > adjustments
> > > > > >> > >> > > > to the
> > > > > >> > >> > > > slot and resource protocol.
> > > > > >> > >> > > >
> > > > > >> > >> > > > In a way it seems overly complicated.
> > > > > >> > >> > > >
> > > > > >> > >> > > > If we look at it purely from an active resource
> > > management
> > > > > >> > perspective,
> > > > > >> > >> > > > then there
> > > > > >> > >> > > > isn't really a need to touch the slot protocol at all
> > (or
> > > > in
> > > > > >> fact
> > > > > >> > to
> > > > > >> > >> > > > anything in the JobMaster),
> > > > > >> > >> > > > because there isn't any point in keeping around
> blocked
> > > TMs
> > > > > in
> > > > > >> the
> > > > > >> > >> > first
> > > > > >> > >> > > > place.
> > > > > >> > >> > > > They'd just be idling, potentially shutting down
> after
> > a
> > > > > while
> > > > > >> by
> > > > > >> > the
> > > > > >> > >> > RM
> > > > > >> > >> > > > because of
> > > > > >> > >> > > > it (unless we _also_ touch that logic).
> > > > > >> > >> > > > Here the blocking of a process (be it by blocking the
> > > > process
> > > > > >> or
> > > > > >> > node)
> > > > > >> > >> > is
> > > > > >> > >> > > > equivalent with shutting down the blocked
> process(es).
> > > > > >> > >> > > > Once the block is lifted we can just spin it back up.
> > > > > >> > >> > > >
> > > > > >> > >> > > > And I do wonder whether we couldn't apply the same
> line
> > > of
> > > > > >> > thinking to
> > > > > >> > >> > > > standalone resource management.
> > > > > >> > >> > > > Here being able to stop/restart a process/node
> manually
> > > > > should
> > > > > >> be
> > > > > >> > a
> > > > > >> > >> > core
> > > > > >> > >> > > > requirement for a Flink deployment anyway.
> > > > > >> > >> > > >
> > > > > >> > >> > > >
> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > > > >> > >> > > > > Hi everyone,
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Thanks for creating this FLIP. I can understand the
> > > > problem
> > > > > >> and
> > > > > >> > I see
> > > > > >> > >> > > > value
> > > > > >> > >> > > > > in the automatic detection and blocklisting. I do
> > have
> > > > some
> > > > > >> > concerns
> > > > > >> > >> > > with
> > > > > >> > >> > > > > the ability to manually specify to be blocked
> > > resources.
> > > > I
> > > > > >> have
> > > > > >> > two
> > > > > >> > >> > > > > concerns;
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > * Most organizations explicitly have a separation
> of
> > > > > >> concerns,
> > > > > >> > >> > meaning
> > > > > >> > >> > > > that
> > > > > >> > >> > > > > there's a group who's responsible for managing a
> > > cluster
> > > > > and
> > > > > >> > there's
> > > > > >> > >> > a
> > > > > >> > >> > > > user
> > > > > >> > >> > > > > group who uses that cluster. With the introduction
> of
> > > > this
> > > > > >> > mechanism,
> > > > > >> > >> > > the
> > > > > >> > >> > > > > latter group now can influence the responsibility
> of
> > > the
> > > > > >> first
> > > > > >> > group.
> > > > > >> > >> > > So
> > > > > >> > >> > > > it
> > > > > >> > >> > > > > can be possible that someone from the user group
> > blocks
> > > > > >> > something,
> > > > > >> > >> > > which
> > > > > >> > >> > > > > causes an outage (which could result in paging
> > > mechanism
> > > > > >> > triggering
> > > > > >> > >> > > etc)
> > > > > >> > >> > > > > which impacts the first group.
> > > > > >> > >> > > > > * How big is the group of people who can go through
> > the
> > > > > >> process
> > > > > >> > of
> > > > > >> > >> > > > manually
> > > > > >> > >> > > > > identifying a node that isn't behaving as it should
> > > be? I
> > > > > do
> > > > > >> > think
> > > > > >> > >> > this
> > > > > >> > >> > > > > group is relatively limited. Does it then make
> sense
> > to
> > > > > >> > introduce
> > > > > >> > >> > such
> > > > > >> > >> > > a
> > > > > >> > >> > > > > feature, which would only be used by a really small
> > > user
> > > > > >> group
> > > > > >> > of
> > > > > >> > >> > > Flink?
> > > > > >> > >> > > > We
> > > > > >> > >> > > > > still have to maintain, test and support such a
> > > feature.
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm
> > leaning
> > > > > >> towards
> > > > > >> > not
> > > > > >> > >> > > > exposing
> > > > > >> > >> > > > > this to the user group but having this available
> > > strictly
> > > > > for
> > > > > >> > cluster
> > > > > >> > >> > > > > operators. They could then also set up their
> > > > > >> > paging/metrics/logging
> > > > > >> > >> > > > system
> > > > > >> > >> > > > > to take this into account.
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Best regards,
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Martijn Visser
> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > > > > >> > >> > > > > https://github.com/MartijnVisser
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > > > [email protected]
> > > > > >> >
> > > > > >> > wrote:
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> +1 for the overall proposal. Just share some cents
> > > here:
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> - Why do we need to expose
> > > > > >> > >> > > > >>
> > cluster.resource-blacklist.item.timeout-check-interval
> > > > to
> > > > > >> the
> > > > > >> > user?
> > > > > >> > >> > > > >> I think the semantics of
> > > > > >> > `cluster.resource-blacklist.item.timeout`
> > > > > >> > >> > is
> > > > > >> > >> > > > >> sufficient for the user. How to guarantee the
> > timeout
> > > > > >> > mechanism is
> > > > > >> > >> > > > >> Flink's internal implementation. I think it will
> be
> > > very
> > > > > >> > confusing
> > > > > >> > >> > and
> > > > > >> > >> > > > >> we do not need to expose it to users.
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> - ResourceManager can notify the exception of a
> task
> > > > > >> manager to
> > > > > >> > >> > > > >> `BlacklistHandler` as well.
> > > > > >> > >> > > > >> For example, the slot allocation might fail in
> case
> > > the
> > > > > >> target
> > > > > >> > task
> > > > > >> > >> > > > >> manager is busy or has a network jitter. I don't
> > mean
> > > we
> > > > > >> need
> > > > > >> > to
> > > > > >> > >> > cover
> > > > > >> > >> > > > >> this case in this version, but we can also open a
> > > > > >> > `notifyException`
> > > > > >> > >> > in
> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> - Before we sync the blocklist to ResourceManager,
> > > will
> > > > > the
> > > > > >> > slot of
> > > > > >> > >> > a
> > > > > >> > >> > > > >> blocked task manager continues to be released and
> > > > > allocated?
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> Best,
> > > > > >> > >> > > > >> Yangze Guo
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > > > > >> > >> > [email protected]>
> > > > > >> > >> > > > >> wrote:
> > > > > >> > >> > > > >>> Hi Konstantin,
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Thanks for your feedback. I will response your 4
> > > > remarks:
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I
> > > think
> > > > > >> > “BlockList”
> > > > > >> > >> > is
> > > > > >> > >> > > > >> good
> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good
> idea.
> > > > Based
> > > > > >> on
> > > > > >> > the
> > > > > >> > >> > > > above, I
> > > > > >> > >> > > > >>> would change REST API as following:
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means
> that
> > > all
> > > > > >> task
> > > > > >> > >> > managers
> > > > > >> > >> > > > on
> > > > > >> > >> > > > >>> this node are blocklisted. All slots on these TMs
> > are
> > > > not
> > > > > >> > >> > available.
> > > > > >> > >> > > > This
> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs
> are
> > > not
> > > > > >> really
> > > > > >> > lost,
> > > > > >> > >> > > > they
> > > > > >> > >> > > > >>> are in an unavailable status, and they are still
> > > > > registered
> > > > > >> > in this
> > > > > >> > >> > > > flink
> > > > > >> > >> > > > >>> cluster. They will be available again once the
> > > > > >> corresponding
> > > > > >> > >> > > blocklist
> > > > > >> > >> > > > >> item
> > > > > >> > >> > > > >>> is removed. This behavior is the same in
> > > > > active/non-active
> > > > > >> > >> > clusters.
> > > > > >> > >> > > > >>> However in the active clusters, these TMs may be
> > > > released
> > > > > >> due
> > > > > >> > to
> > > > > >> > >> > idle
> > > > > >> > >> > > > >>> timeouts.
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The
> > > > reasons
> > > > > >> are
> > > > > >> > as
> > > > > >> > >> > > > >> following:
> > > > > >> > >> > > > >>> a) The timeout will not affect users adding or
> > > removing
> > > > > >> items
> > > > > >> > via
> > > > > >> > >> > > REST
> > > > > >> > >> > > > >> API,
> > > > > >> > >> > > > >>> and users can disable it by configuring it to
> > > > > >> Long.MAX_VALUE .
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> b) Some node problems can recover after a period
> of
> > > > time
> > > > > >> > (such as
> > > > > >> > >> > > > machine
> > > > > >> > >> > > > >>> hotspots), in which case users may prefer that
> > Flink
> > > > can
> > > > > do
> > > > > >> > this
> > > > > >> > >> > > > >>> automatically instead of requiring the user to do
> > it
> > > > > >> manually.
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Best,
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Lijie
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Konstantin Knauf <[email protected]>
> 于2022年4月27日周三
> > > > > >> 19:23写道：
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>> Hi Lijie,
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only support
> > > > > manually
> > > > > >> > blocking
> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
> > > strategies
> > > > > can
> > > > > >> > also be
> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> A few remarks:
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due
> > to
> > > > the
> > > > > >> > >> > controversy
> > > > > >> > >> > > > >> around
> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about
> > > this
> > > > > >> topic a
> > > > > >> > >> > while
> > > > > >> > >> > > > >> back
> > > > > >> > >> > > > >>>> and there was generally a consensus to avoid the
> > > term
> > > > > >> > blacklist &
> > > > > >> > >> > > > >> whitelist
> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > > > > "quarantined"
> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly
> > different
> > > > > >> design
> > > > > >> > as
> > > > > >> > >> > verbs
> > > > > >> > >> > > > >> like
> > > > > >> > >> > > > >>>> add/remove often considered an anti-pattern for
> > REST
> > > > > APIs.
> > > > > >> > POST
> > > > > >> > >> > on a
> > > > > >> > >> > > > >> list
> > > > > >> > >> > > > >>>> item is generally the standard to add items.
> > DELETE
> > > on
> > > > > the
> > > > > >> > >> > > individual
> > > > > >> > >> > > > >>>> resource is standard to remove an item.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> We could also consider to separate taskmanagers
> > and
> > > > > nodes
> > > > > >> in
> > > > > >> > the
> > > > > >> > >> > > REST
> > > > > >> > >> > > > >> API
> > > > > >> > >> > > > >>>> (and internal data structures). Any opinion on
> > this?
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > > > > >> > >> > > > >>>> DELETE
> <host>/quarantine/taskmanager/<identifier>
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
> non-active
> > > > > >> resource
> > > > > >> > >> > > managers,
> > > > > >> > >> > > > >> i.e.
> > > > > >> > >> > > > >>>> standalone or reactive mode?
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> 4) To keep the implementation even more minimal,
> > do
> > > we
> > > > > >> need
> > > > > >> > the
> > > > > >> > >> > > > timeout
> > > > > >> > >> > > > >>>> behavior? If items are added/removed manually we
> > > could
> > > > > >> > delegate
> > > > > >> > >> > this
> > > > > >> > >> > > > >> to the
> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout behavior
> > > would
> > > > > >> better
> > > > > >> > fit
> > > > > >> > >> > > into
> > > > > >> > >> > > > >>>> specific strategies at a later point.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Cheers and thank you,
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Konstantin
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> [1]
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > > > >> > >> > > > >>>> [2]
> > > https://issues.apache.org/jira/browse/FLINK-18209
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie
> > > Wang
> > > > <
> > > > > >> > >> > > > >>>> [email protected]>:
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>>> Hi all,
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Flink job failures may happen due to cluster
> node
> > > > > issues
> > > > > >> > >> > > > >> (insufficient
> > > > > >> > >> > > > >>>> disk
> > > > > >> > >> > > > >>>>> space, bad hardware, network abnormalities).
> > Flink
> > > > will
> > > > > >> > take care
> > > > > >> > >> > > of
> > > > > >> > >> > > > >> the
> > > > > >> > >> > > > >>>>> failures and redeploy the tasks. However, due
> to
> > > data
> > > > > >> > locality
> > > > > >> > >> > and
> > > > > >> > >> > > > >>>> limited
> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely to be
> > > > > redeployed
> > > > > >> > to the
> > > > > >> > >> > > same
> > > > > >> > >> > > > >>>>> nodes, which will result in continuous task
> > > > > abnormalities
> > > > > >> > and
> > > > > >> > >> > > affect
> > > > > >> > >> > > > >> job
> > > > > >> > >> > > > >>>>> progress.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually
> identify
> > > the
> > > > > >> > problematic
> > > > > >> > >> > > > >> node and
> > > > > >> > >> > > > >>>>> take it offline to solve this problem. But this
> > > > > approach
> > > > > >> has
> > > > > >> > >> > > > >> following
> > > > > >> > >> > > > >>>>> disadvantages:
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy
> process.
> > > > Users
> > > > > >> may
> > > > > >> > need
> > > > > >> > >> > to
> > > > > >> > >> > > > >>>> contact
> > > > > >> > >> > > > >>>>> cluster administors to do this. The operation
> can
> > > > even
> > > > > be
> > > > > >> > >> > dangerous
> > > > > >> > >> > > > >> and
> > > > > >> > >> > > > >>>> not
> > > > > >> > >> > > > >>>>> allowed during some important business events.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of
> problems
> > > > > manually
> > > > > >> > would
> > > > > >> > >> > be
> > > > > >> > >> > > > >> slow
> > > > > >> > >> > > > >>>> and
> > > > > >> > >> > > > >>>>> a waste of human resources.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> > > > > >> introduce a
> > > > > >> > >> > > blacklist
> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> > > > > resources.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> You can find more details in FLIP-224[1].
> Looking
> > > > > forward
> > > > > >> > to your
> > > > > >> > >> > > > >>>> feedback.
> > > > > >> > >> > > > >>>>> [1]
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Best,
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Lijie
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > >
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Roman Boyko
> > > e.: [email protected]
> > >
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Reply via email to