Hi Yuepeng,

Thanks for your feedback. I agree with u, both approaches can achieve the
goal.
As long as we can easily extend the balancing strategy to consider more
than one factors without changing the interface, the solution is OK for me.

Regards,
Xiangyu

Yuepeng Pan <panyuep...@apache.org> 于2023年10月11日周三 17:38写道:

> Hi, xiangyu.
> Thanks for your quick reply.
>
> >interface currently only includes a description of the number of tasks.
> So,
> >IIUC, If there is a need to further expand
> >current interface and its implementations, right?
>
> Yes, that's indeed the case.
>
> >I checked the interface design of LoadingWeight and WeightLoadable, AFAIK
> >currently it only supports comparing the load for one factor. If we want
> to
> >add more loading factors, LoadingWeight might need to add a 'LoadType'
> >field for distinction, WeightLoadable might need to return
> >Set<LoadingWeight>.
>
> Thank you for the clarification, I think I roughly understand your
> description:
> In fact, regarding the specific implementation and extension of this
> LoadingWeight, we can extend it based on this interface and its
> implementation as mentioned above.
> If making frequent changes to the interface and its implementation is
> really tiresome, we can also consider introducing a built-in collapsible
> Map or other type of attribute, like the SlotSharingGroup class in the
> org.apache.flink.api.common.operators package, to describe the specific
> collection of load values and types. This way, these loads are collapsed
> within the LoadingWeight's implementation and can be expanded when needed
> for use. Of course, we can also consider an implementation like the one you
> mentioned, introducing a method in WeightLoadable that returns a collection
> as the return type, so the load values are expanded at the calling site and
> then used. As I understand it, both approaches can achieve the goal.
>
> Of course, I also look forward to hearing others' suggestions. If there
> are any mistakes in my statement, please correct me.
> Looking forward to your reply.
>
> Best regards.
> Yuepeng Pan
>
> On 2023/10/11 08:44:51 xiangyu feng wrote:
> > Hi Yuepeng,
> >
> > Thx for ur reply.
> >
> > > Nice feedback. In fact, as mentioned in the Google Doc, the
> LoadingWeight
> > interface currently only includes a description of the number of tasks.
> So,
> > IIUC, If there is a need to further expand
> > > descriptions of other resource loads, we just extend it based on the
> > current interface and its implementations, right?
> >
> > I checked the interface design of LoadingWeight and WeightLoadable, AFAIK
> > currently it only supports comparing the load for one factor. If we want
> to
> > add more loading factors, LoadingWeight might need to add a 'LoadType'
> > field for distinction, WeightLoadable might need to return
> > Set<LoadingWeight>.
> >
> > I'm not sure I understand this correctly, WDYT?
> >
> > Regards,
> > Xiangyu
> >
> > Yuepeng Pan <panyuep...@apache.org> 于2023年10月11日周三 13:53写道:
> >
> > > Hi, xiangyu,
> > > Thanks for your attention as well.
> > >
> > > >1, About the waiting mechanism: Will the waiting mechanism happen
> only in
> > > >the second level 'assigning slots to TM'? IIUC, the first level
> 'assigning
> > > >Tasks to Slots' needs only the asynchronous slot result from slotpool.
> > >
> > > As described in the latest FLIP, the introduction of the waiting
> mechanism
> > > at the second level is to ensure that, in all deployment modes such as
> > > application, session, etc., we do not fall into a local greedy state
> when
> > > selecting the optimal slot position. This requires obtaining a global
> > > resource view to get the best result.
> > > IIUC, The allocation process from Task to Slot is the generation of a
> > > mapping relationship between two abstract descriptions, and at this
> point,
> > > there is no coupling of information between tasks/slots and Task
> Managers
> > > (TMs).
> > >
> > >
> > > >2, About the slot LoadingWeight: it is reasonable to use the number of
> > > >tasks by default in the beginning, but it would be better if this
> could be
> > > >easily extended in future to distinguish between CPU-intensive and
> > > >IO-intensive workloads. In some cases, TMs may have IO bottlenecks but
> > > >others have CPU bottlenecks.
> > >
> > > Nice feedback. In fact, as mentioned in the Google Doc, the
> LoadingWeight
> > > interface currently only includes a description of the number of
> tasks. So,
> > > IIUC, If there is a need to further expand descriptions of other
> resource
> > > loads, we just extend it based on the current interface and its
> > > implementations, right?
> > > Please correct me if I have misunderstood. Thanks a lot~
> > >
> > > Best,
> > > Yuepeng.
> > >
> > > On 2023/10/06 10:19:21 xiangyu feng wrote:
> > > > Thanks Yuepeng and Rui for driving this Discussion.
> > > >
> > > > Internally when we try to use Flink 1.17.1 in production, we are also
> > > > suffering from the unbalanced task distribution problem for jobs with
> > > high
> > > > qps and complex dag. So +1 for the overall proposal.
> > > >
> > > > Some questions about the details:
> > > >
> > > > 1, About the waiting mechanism: Will the waiting mechanism happen
> only in
> > > > the second level 'assigning slots to TM'?  IIUC, the first level
> > > 'assigning
> > > > Tasks to Slots' needs only the asynchronous slot result from
> slotpool.
> > > >
> > > > 2, About the slot LoadingWeight: it is reasonable to use the number
> of
> > > > tasks by default in the beginning, but it would be better if this
> could
> > > be
> > > > easily extended in future to distinguish between CPU-intensive and
> > > > IO-intensive workloads. In some cases, TMs may have IO bottlenecks
> but
> > > > others have CPU bottlenecks.
> > > >
> > > > Regards,
> > > > Xiangyu
> > > >
> > > >
> > > > Yuepeng Pan <panyuep...@apache.org> 于2023年10月5日周四 18:34写道:
> > > >
> > > > > Hi, Zhu Zhu,
> > > > >
> > > > > Thanks for your feedback!
> > > > >
> > > > > > I think we can introduce a new config option
> > > > > > `taskmanager.load-balance.mode`,
> > > > > > which accepts "None"/"Slots"/"Tasks".
> > > `cluster.evenly-spread-out-slots`
> > > > > > can be superseded by the "Slots" mode and get deprecated. In the
> > > future
> > > > > > it can support more mode, e.g. "CpuCores", to work better for
> jobs
> > > with
> > > > > > fine-grained resources. The proposed config option
> > > > > > `slot.request.max-interval`
> > > > > > then can be renamed to
> > > > > `taskmanager.load-balance.request-stablizing-timeout`
> > > > > > to show its relation with the feature. The proposed
> > > > > `slot.sharing-strategy`
> > > > > > is not needed, because the configured "Tasks" mode will do the
> work.
> > > > >
> > > > > The new proposed configuration option sounds good to me.
> > > > >
> > > > > I have a small question, If we set our configuration value to
> 'Tasks,'
> > > it
> > > > > will initiate two processes: balancing the allocation of task
> > > quantities at
> > > > > the slot level and balancing the number of tasks across
> TaskManagers
> > > (TMs).
> > > > > Alternatively, if we configure it as 'Slots,' the system will
> employ
> > > the
> > > > > LocalPreferred allocation policy (which is the default) when
> assigning
> > > > > tasks to slots, and it will ensure that the number of slots used
> > > across TMs
> > > > > is balanced.
> > > > > Does  this configuration essentially combine a balanced selection
> > > strategy
> > > > > across two dimensions into fixed configuration items, right?
> > > > >
> > > > > I would appreciate it if you could correct me if I've made any
> errors.
> > > > >
> > > > > Best,
> > > > > Yuepeng.
> > > > >
> > > >
> > >
> >
>

Reply via email to