Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

Rui Fan Sat, 07 Oct 2023 00:57:48 -0700

Hi Shammon,

IIUC, you want more flexibility in controlling the two-phase strategy,
right?


> I want this because we would like to add a new slot to TM strategy such
as SLOTS_NUM in the future for OLAP to improve the performance for olap
jobs, which will use TASKS strategy for task to slot. cc Guoyangze

Actually, one option can achieve your requirement, it can control two-phase.
We can add a new enum for this option, and it will use the new strategy for
slot to TM, and use task_balanced strategy for task to slot.

Of course, I think 2 options is more flexible. If the strategy is too many,
2 options are easy for users.

Also, I have a question: What is SLOTS_NUM strategy? Isn't it slot balanced
at tm level?
I want to check whether it's similar to `cluster.evenly-spread-out-slots`.
If they are similar or same, the strategy isn't too many, and one option
may be enough.

Best,
Rui

On Sat, Oct 7, 2023 at 11:29 AM Shammon FY <zjur...@gmail.com> wrote:

> Thanks Rui, I check the codes and you're right.
>
> As you described above, the entire process is actually two independent
> steps from slot to TM and task to slot. Currenlty we use option
> `cluster.evenly-spread-out-slots` for both of them. Can we provide
> different options for the two steps, such as ANY/SLOTS for RM and ANY/TASKS
> for slot pool?
>
> I want this because we would like to add a new slot to TM strategy such as
> SLOTS_NUM in the future for OLAP to improve the performance for olap jobs,
> which will use TASKS strategy for task to slot. cc Guoyangze
>
> Best,
> Shammon FY
>
> On Fri, Oct 6, 2023 at 6:19 PM xiangyu feng <xiangyu...@gmail.com> wrote:
>
>> Thanks Yuepeng and Rui for driving this Discussion.
>>
>> Internally when we try to use Flink 1.17.1 in production, we are also
>> suffering from the unbalanced task distribution problem for jobs with high
>> qps and complex dag. So +1 for the overall proposal.
>>
>> Some questions about the details:
>>
>> 1, About the waiting mechanism: Will the waiting mechanism happen only in
>> the second level 'assigning slots to TM'?  IIUC, the first level
>> 'assigning
>> Tasks to Slots' needs only the asynchronous slot result from slotpool.
>>
>> 2, About the slot LoadingWeight: it is reasonable to use the number of
>> tasks by default in the beginning, but it would be better if this could be
>> easily extended in future to distinguish between CPU-intensive and
>> IO-intensive workloads. In some cases, TMs may have IO bottlenecks but
>> others have CPU bottlenecks.
>>
>> Regards,
>> Xiangyu
>>
>>
>> Yuepeng Pan <panyuep...@apache.org> 于2023年10月5日周四 18:34写道：
>>
>> > Hi, Zhu Zhu,
>> >
>> > Thanks for your feedback!
>> >
>> > > I think we can introduce a new config option
>> > > `taskmanager.load-balance.mode`,
>> > > which accepts "None"/"Slots"/"Tasks".
>> `cluster.evenly-spread-out-slots`
>> > > can be superseded by the "Slots" mode and get deprecated. In the
>> future
>> > > it can support more mode, e.g. "CpuCores", to work better for jobs
>> with
>> > > fine-grained resources. The proposed config option
>> > > `slot.request.max-interval`
>> > > then can be renamed to
>> > `taskmanager.load-balance.request-stablizing-timeout`
>> > > to show its relation with the feature. The proposed
>> > `slot.sharing-strategy`
>> > > is not needed, because the configured "Tasks" mode will do the work.
>> >
>> > The new proposed configuration option sounds good to me.
>> >
>> > I have a small question, If we set our configuration value to 'Tasks,'
>> it
>> > will initiate two processes: balancing the allocation of task
>> quantities at
>> > the slot level and balancing the number of tasks across TaskManagers
>> (TMs).
>> > Alternatively, if we configure it as 'Slots,' the system will employ the
>> > LocalPreferred allocation policy (which is the default) when assigning
>> > tasks to slots, and it will ensure that the number of slots used across
>> TMs
>> > is balanced.
>> > Does  this configuration essentially combine a balanced selection
>> strategy
>> > across two dimensions into fixed configuration items, right?
>> >
>> > I would appreciate it if you could correct me if I've made any errors.
>> >
>> > Best,
>> > Yuepeng.
>> >
>>
>

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

Reply via email to