Re: [Discuss] FLIP-362: Support minimum resource limitation

xiangyu feng Wed, 04 Oct 2023 20:37:21 -0700

Hi David,

Glad to hear you back!


> Agreed; in my mind, this boils down to the ability to quickly allocate new
slots (TMs). This might differ between environments though.

Yes, for interactive queries cold-start is a very tricky issue to dealing
with,
we should consider not only about allocating new resources ASAP but also
warming up the newly added TaskManagers.
Internally, we have done lots of work to address this problem.
Minimum resource limitation is the first step, we will
keep working on this.

Appreciate your feedback again.

Regards,
Xiangyu


David Morávek <[email protected]> 于2023年10月4日周三 22:58写道：

> > If not, what is the difference between the spare resources and redundant
> taskmanagers?
>
> I wasn't aware of this one; good catch! The main difference is that you
> don't express the spare resources in terms of slots but in terms of task
> managers. Also, those options serve slightly different purpose, and users
> configuring slot manager might not look for another option somewhere else.
>
> > Secondly, IMHO the difference between min-reserved resource and spare
> resources is that we could configure a rather large min-reserved resource
>
> Agreed; in my mind, this boils down to the ability to quickly allocate new
> slots (TMs). This might differ between environments though. In most cases,
> there should be some time between interactive queries unless they're
> submitted programmatically. I can see the value of having both (min + slots
> to keep around).
>
> All in all, I don't have a strong opinion here, it's a significant
> improvement either way. This was just the first thing that I thought about
> after reading the flip.
>
> Best,
> D.
>
> On Tue, Oct 3, 2023 at 2:10 PM xiangyu feng <[email protected]> wrote:
>
> > Hi David,
> >
> > Thx for your feedback.
> >
> > First of all, for keeping some spare resources around, do you mean
> > 'Redundant TaskManagers'[1]? If not, what is the difference between the
> > spare resources and redundant taskmanagers?
> >
> > Secondly, IMHO the difference between min-reserved resource and spare
> > resources is that we could configure a rather large min-reserved resource
> > for user cases submitting lots of short-lived jobs concurrently, but we
> > don't want to configure a large spare resource since this might double
> the
> > total resource usage and lead to resource waste.
> >
> > Looking forward to hearing from you.
> >
> > Regards,
> > Xiangyu
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18625
> >
> > David Morávek <[email protected]> 于2023年10月3日周二 05:00写道：
> >
> > > H Xiangyui,
> > >
> > > The sentiment of the FLIP makes sense, but I keep wondering whether
> this
> > > is the best way to think about the problem. I assume that "interactive
> > > session cluster" users always want to keep some spare resources around
> > (up
> > > to a configured threshold) to reduce cold start instead of statically
> > > configuring the minimum.
> > >
> > > It's just a tiny change from the original proposal, but it could make
> all
> > > the difference (eliminate overprovisioning, maintain latencies with a
> > > growing # of jobs, ..)
> > >
> > > WDYT?
> > >
> > > Best,
> > > D.
> > >
> > > On Mon, Sep 25, 2023 at 5:11 PM Jing Ge <[email protected]>
> > > wrote:
> > >
> > >> Hi Yangze,
> > >>
> > >> Thanks for the clarification. The example of two batch jobs team up
> with
> > >> one streaming job is interesting.
> > >>
> > >> Best regards,
> > >> Jing
> > >>
> > >> On Wed, Sep 20, 2023 at 7:19 PM Yangze Guo <[email protected]>
> wrote:
> > >>
> > >> > Thanks for the comments, Jing.
> > >> >
> > >> > > Will the minimum resource configuration also take effect for
> > streaming
> > >> > jobs in application mode?
> > >> > > Since it is not recommended to configure
> > >> slotmanager.number-of-slots.max
> > >> > for streaming jobs, does it make sense to disable it for common
> > >> streaming
> > >> > jobs? At least disable the check for avoiding the oscillation?
> > >> >
> > >> > Yes. The minimum resource configuration will only disabled in
> > >> > standalone cluster atm. I agree it make sense to disable it for a
> pure
> > >> > streaming job, however:
> > >> > - By default, the minimum resource is configured to 0. If users do
> not
> > >> > proactively set it, either the oscillation check or the minimum
> > >> > restriction can be considered as disabled.
> > >> > - The minimum resource is a cluster-level configuration rather than
> a
> > >> > job-level configuration. If a user has an application with two batch
> > >> > jobs preceding the streaming job, they may also require this
> > >> > configuration to accelerate the execution of batch jobs.
> > >> >
> > >> > WDYT?
> > >> >
> > >> > Best,
> > >> > Yangze Guo
> > >> >
> > >> > On Thu, Sep 21, 2023 at 4:49 AM Jing Ge <[email protected]
> >
> > >> > wrote:
> > >> > >
> > >> > > Hi Xiangyu,
> > >> > >
> > >> > > Thanks for driving it! There is one thing I am not really sure if
> I
> > >> > > understand you correctly.
> > >> > >
> > >> > > According to the FLIP: "The minimum resource limitation will be
> > >> > implemented
> > >> > > in the DefaultResourceAllocationStrategy of
> FineGrainedSlotManager.
> > >> > >
> > >> > > Each time when SlotManager needs to reconcile the cluster
> resources
> > or
> > >> > > fulfill job resource requirements, the
> > >> DefaultResourceAllocationStrategy
> > >> > > will check if the minimum resource requirement has been fulfilled.
> > If
> > >> it
> > >> > is
> > >> > > not, DefaultResourceAllocationStrategy will request new
> > >> > PendingTaskManagers
> > >> > > and FineGrainedSlotManager will allocate new worker resources
> > >> > accordingly."
> > >> > >
> > >> > > "To avoid this oscillation, we need to check the worker number
> > derived
> > >> > from
> > >> > > minimum and maximum resource configuration is consistent before
> > >> starting
> > >> > > SlotManager."
> > >> > >
> > >> > > Will the minimum resource configuration also take effect for
> > streaming
> > >> > jobs
> > >> > > in application mode? Since it is not recommended to
> > >> > > configure slotmanager.number-of-slots.max for streaming jobs, does
> > it
> > >> > make
> > >> > > sense to disable it for common streaming jobs? At least disable
> the
> > >> check
> > >> > > for avoiding the oscillation?
> > >> > >
> > >> > > Best regards,
> > >> > > Jing
> > >> > >
> > >> > >
> > >> > > On Tue, Sep 19, 2023 at 4:58 PM Chen Zhanghao <
> > >> [email protected]
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks for driving this, Xiangyu. We use Session clusters for
> > quick
> > >> SQL
> > >> > > > debugging internally, and found cold-start job submission slow
> due
> > >> to
> > >> > lack
> > >> > > > of the exact minimum resource reservation feature proposed here.
> > >> This
> > >> > > > should improve the experience a lot for running short lived-jobs
> > in
> > >> > session
> > >> > > > clusters.
> > >> > > >
> > >> > > > Best,
> > >> > > > Zhanghao Chen
> > >> > > > ________________________________
> > >> > > > 发件人: Yangze Guo <[email protected]>
> > >> > > > 发送时间: 2023年9月19日 13:10
> > >> > > > 收件人: xiangyu feng <[email protected]>
> > >> > > > 抄送: [email protected] <[email protected]>
> > >> > > > 主题: Re: [Discuss] FLIP-362: Support minimum resource limitation
> > >> > > >
> > >> > > > Thanks for driving this @Xiangyu. This is a feature that many
> > users
> > >> > > > have requested for a long time. +1 for the overall proposal.
> > >> > > >
> > >> > > > Best,
> > >> > > > Yangze Guo
> > >> > > >
> > >> > > > On Tue, Sep 19, 2023 at 11:48 AM xiangyu feng <
> > [email protected]
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > Hi Devs,
> > >> > > > >
> > >> > > > > I'm opening this thread to discuss FLIP-362: Support minimum
> > >> resource
> > >> > > > limitation. The design doc can be found at:
> > >> > > > > FLIP-362: Support minimum resource limitation
> > >> > > > >
> > >> > > > > Currently, the Flink cluster only requests Task Managers (TMs)
> > >> when
> > >> > > > there is a resource requirement, and idle TMs are released
> after a
> > >> > certain
> > >> > > > period of time. However, in certain scenarios, such as running
> > short
> > >> > > > lived-jobs in session cluster and scheduling batch jobs stage by
> > >> > stage, we
> > >> > > > need to improve the efficiency of job execution by maintaining a
> > >> > certain
> > >> > > > number of available workers in the cluster all the time.
> > >> > > > >
> > >> > > > > After discussed with Yangze, we introduced this new feature.
> The
> > >> new
> > >> > > > added public options and proposed changes are described in this
> > >> FLIP.
> > >> > > > >
> > >> > > > > Looking forward to your feedback, thanks.
> > >> > > > >
> > >> > > > > Best regards,
> > >> > > > > Xiangyu
> > >> > > > >
> > >> > > >
> > >> >
> > >>
> > >
> >
>

Re: [Discuss] FLIP-362: Support minimum resource limitation

Reply via email to