Thanks for driving this FLIP, Yuepeng Pan. +1 for the overall proposal
to support balanced scheduling.

Some questions on the Waiting mechanism and Allocation strategy for slot to TM:

1. Is it possible for the SlotPool to get the slot allocation results
from the SlotManager in advance instead of waiting for the actual
physical slots to be registered, and perform pre-allocation? The
benefit of doing this is to make the task deployment process smoother,
especially when there are a large number of tasks in the job.

2. If user enable the cluster.evenly-spread-out-slots, the issue in
example 2 of section 2.2.3 can be resolved. Do I understand it
correctly?




Best,
Yangze Guo

On Wed, Sep 27, 2023 at 9:12 PM Zhu Zhu <reed...@gmail.com> wrote:
>
> Thanks Yuepeng and Rui for creating this FLIP.
>
> +1 in general
> The idea is straight forward: best-effort gather all the slot requests
> and offered slots to form an overview before assigning slots, trying to
> balance the loads of task managers when assigning slots.
>
> I have one comment regarding the configuration for ease of use:
>
> IIUC, this FLIP uses an existing config 'cluster.evenly-spread-out-slots'
> as the main switch of the new feature. That is, from user perspective,
> with this improvement, the 'cluster.evenly-spread-out-slots' feature not
> only balances the number of slots on task managers, but also balances the
> number of tasks. This is a behavior change anyway. Besides that, it also
> requires users to set 'slot.sharing-strategy' to 'TASK_BALANCED_PREFERRED'
> to balance the tasks in each slot.
>
> I think we can introduce a new config option
> `taskmanager.load-balance.mode`,
> which accepts "None"/"Slots"/"Tasks". `cluster.evenly-spread-out-slots`
> can be superseded by the "Slots" mode and get deprecated. In the future
> it can support more mode, e.g. "CpuCores", to work better for jobs with
> fine-grained resources. The proposed config option
> `slot.request.max-interval`
> then can be renamed to `taskmanager.load-balance.request-stablizing-timeout`
> to show its relation with the feature. The proposed `slot.sharing-strategy`
> is not needed, because the configured "Tasks" mode will do the work.
>
> WDYT?
>
> Thanks,
> Zhu Zhu
>
> Yuepeng Pan <panyuep...@apache.org> 于2023年9月25日周一 16:26写道:
>
> > Hi all,
> >
> >
> > I and Fan Rui(CC’ed) created the FLIP-370[1] to support balanced tasks
> > scheduling.
> >
> >
> > The current strategy of Flink to deploy tasks sometimes leads some
> > TMs(TaskManagers) to have more tasks while others have fewer tasks,
> > resulting in excessive resource utilization at some TMs that contain more
> > tasks and becoming a bottleneck for the entire job processing. Developing
> > strategies to achieve task load balancing for TMs and reducing job
> > bottlenecks becomes very meaningful.
> >
> >
> > The raw design and discussions could be found in the Flink JIRA[2] and
> > Google doc[3]. We really appreciate Zhu Zhu(CC’ed) for providing some
> > valuable help and suggestions in advance.
> >
> >
> > Please refer to the FLIP[1] document for more details about the proposed
> > design and implementation. We welcome any feedback and opinions on this
> > proposal.
> >
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-31757
> >
> > [3]
> > https://docs.google.com/document/d/14WhrSNGBdcsRl3IK7CZO-RaZ5KXU2X1dWqxPEFr3iS8
> >
> >
> > Best,
> >
> > Yuepeng Pan
> >

Reply via email to