That makes sense. I suggest we add one note to the KIP to avoid confusion On Wed, Sep 18, 2019 at 9:51 AM Xintong Song <tonysong...@gmail.com> wrote:
> @tao > > I think we cannot limit the cpu usage of a slot, nor isolate the usages > between slots. We do have cpu limits for the task executor in some > scenarios, such as on yarn with strict cgroup mode. > > The purpose of bookkeep and dynamic allocation of cpu cores is to prevent > scheduling tasks with too many computation loads to the task executor, > rather than limit the cpu usage of each slot. > > Thank you~ > > Xintong Song > > > > On Wed, Sep 18, 2019 at 12:18 AM tao xiao <xiaotao...@gmail.com> wrote: > > > Sorry if I ask a question that has been addressed before. please point me > > to the reference. > > > > How do we limit the cpu usage to a slot? Does the thread that executes > the > > slot get paused when it uses CPU cycles more than it requests? > > > > On Tue, Sep 17, 2019 at 10:23 PM Xintong Song <tonysong...@gmail.com> > > wrote: > > > > > Thanks for the feedback, Andrey. > > > > > > I'll start the vote. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Sep 17, 2019 at 10:09 PM Andrey Zagrebin <azagre...@apache.org > > > > > wrote: > > > > > > > Thanks for the update @Xintong. > > > > I would be ok with starting the vote. > > > > > > > > Best, > > > > Andrey > > > > > > > > On Tue, Sep 17, 2019 at 6:12 AM Xintong Song <tonysong...@gmail.com> > > > > wrote: > > > > > > > > > The implementation plan [1] is updated, with the following changes: > > > > > > > > > > - Add default slot resource profile to > > > > > ResourceManagerGateway#registerTaskExecutor rather than > > > > #sendSlotReport. > > > > > - Swap 'TaskExecutor derive and register with default slot > > resource > > > > > profile' and 'Extend TaskExecutor to support dynamic slot > > > allocation' > > > > > - Add step for updating RestAPI / Web UI > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > On Tue, Sep 17, 2019 at 11:49 AM Xintong Song < > tonysong...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > @Till > > > > > > Thanks for the reminding. I'll add a step for updating the web > ui. > > > I'll > > > > > > try to involve Lining to help us with this step. > > > > > > > > > > > > @Andrey > > > > > > I was thinking that after we define the RM-TM interfaces in step > 2, > > > it > > > > > > would be good to concurrently work on both RM and TM side. But > yes, > > > if > > > > we > > > > > > finish Step 4 early, then it would make step 6 easier. We can > start > > > to > > > > > have > > > > > > some IT/E2E tests, with the default slot resource profiles being > > > > > available. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin < > > > and...@ververica.com> > > > > > > wrote: > > > > > > > > > > > >> @Xintong > > > > > >> > > > > > >> Thanks for the feedback. > > > > > >> > > > > > >> Just to clarify step 6: > > > > > >> If the first point is done before step 5 (e.g. as part of 4) > then > > it > > > > is > > > > > >> just keeping the info about the default slot in RM's data > > structure > > > > > >> associated the TM and no real change in the behaviour. > > > > > >> When this info is available, I think it can be straightforwardly > > > used > > > > > >> during step 5 where we get either concrete slot requirement > > > > > >> or the unknown one (step 6, point 2) which simply grabs some of > > the > > > > > >> concrete default ones (btw not clear which one, seems just some > > > > random?) > > > > > >> > > > > > >> For steps 5,7, true, it is not quite clear whether we can avoid > > some > > > > > >> split, > > > > > >> e.g. after step 5 before doing step 7. > > > > > >> I agree that we should introduce the feature flag if we clearly > > see > > > > that > > > > > >> it > > > > > >> would be a bigger effort without the flag. > > > > > >> > > > > > >> Best, > > > > > >> Andrey > > > > > >> > > > > > >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann < > > trohrm...@apache.org > > > > > > > > > >> wrote: > > > > > >> > > > > > >> > One thing which was briefly mentioned in the Flip but not in > the > > > > > >> > implementation plan is the update of the web UI. I think it is > > > worth > > > > > >> > putting an extra item for updating the web UI to properly > > display > > > > the > > > > > >> > resources a TM has still to offer with dynamic slot > allocation. > > I > > > > > guess > > > > > >> we > > > > > >> > need to pull in some JavaScript help in order to implement > this > > > > step. > > > > > >> > > > > > > >> > Cheers, > > > > > >> > Till > > > > > >> > > > > > > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song < > > > tonysong...@gmail.com > > > > > > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Thanks for the comments, Andrey. > > > > > >> > > > > > > > >> > > - I agree that instead of > > ResourceManagerGateway#sendSlotReport, > > > > we > > > > > >> > should > > > > > >> > > add the default slot resource profile to > > > > > >> > > ResourceManagerGateway#registerTaskExecutor. > > > > > >> > > > > > > > >> > > - If I understand correctly, the reason you suggest do > default > > > > slot > > > > > >> > > resource profile first and then do step 3 in a way that > > support > > > > both > > > > > >> > > TaskExecutorGateway#requestSlot and > > > > > >> TaskExecutorGateway#requestResource, > > > > > >> > is > > > > > >> > > to try to avoid splitting code paths with the feature > option? > > I > > > > > think > > > > > >> we > > > > > >> > > can do that, but I also want to bring it up that this can > only > > > > > reduce > > > > > >> the > > > > > >> > > code split by the feature option (which is good) but not > > > eliminate > > > > > >> it. We > > > > > >> > > still need the feature option for the fundamental > differences, > > > > e.g. > > > > > >> > > creating new SlotIDs on allocation vs. allocate to free > slots > > > with > > > > > >> > existing > > > > > >> > > SlotIDs. > > > > > >> > > > > > > > >> > > - I don't really think we can do step 5, 6 and 7 > > independently. > > > > > >> Basically > > > > > >> > > they are all making changes to the same component. We > probably > > > can > > > > > do > > > > > >> > step > > > > > >> > > 6 and 7 independently, but I think they both depends on step > > 5. > > > > > >> > > > > > > > >> > > In general, I would say it's good to have as less as > possible > > > > codes > > > > > >> split > > > > > >> > > by the feature option, which makes the later clean-up > easier. > > > But > > > > if > > > > > >> it > > > > > >> > > cannot be easily done, I would rather not to put too much > > > efforts > > > > on > > > > > >> > having > > > > > >> > > a good abstraction and deduplication between the new code > path > > > and > > > > > the > > > > > >> > > original one that we are removing soon. > > > > > >> > > > > > > > >> > > What do you think? > > > > > >> > > > > > > > >> > > Thank you~ > > > > > >> > > > > > > > >> > > Xintong Song > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin < > > > > > and...@ververica.com > > > > > >> > > > > > > >> > > wrote: > > > > > >> > > > > > > > >> > > > Hi Xintong, > > > > > >> > > > > > > > > >> > > > Thanks for sharing the implementation steps. I also think > > they > > > > > makes > > > > > >> > > sense > > > > > >> > > > with the feature option. > > > > > >> > > > > > > > > >> > > > I was wondering if we could order the steps in a way that > > each > > > > > >> change > > > > > >> > > does > > > > > >> > > > not affect other components too much, always having a > > working > > > > > system > > > > > >> > > > then maybe the feature option does not always need to > split > > > the > > > > > >> code. > > > > > >> > > Here > > > > > >> > > > are some thoughts. > > > > > >> > > > > > > > > >> > > > - We could do default slot profile firstly and include it > > into > > > > the > > > > > >> TM > > > > > >> > > > registration. I would suggest to add > > > > > >> > > > to ResourceManagerGateway#registerTaskExecutor, not > > > > > sendSlotReport. > > > > > >> > > > This way RM knows about it but does not use at this > point. > > > > > (parts > > > > > >> of > > > > > >> > > step > > > > > >> > > > 4,6) > > > > > >> > > > > > > > > >> > > > - We could try to do step 3 firstly in a way that it also > > > > supports > > > > > >> the > > > > > >> > > > current way of allocation in > TaskExecutorGateway#requestSlot > > > > with > > > > > >> the > > > > > >> > > > default slot profile > > > > > >> > > > and sends reports both with available resources and with > > > free > > > > > >> default > > > > > >> > > > slots which correspond to the available resources. We can > > just > > > > > >> remove > > > > > >> > > free > > > > > >> > > > default slots later. > > > > > >> > > > The new way of TaskExecutorGateway#requestResource could > > be > > > > also > > > > > >> > > > implemented here but not used yet. > > > > > >> > > > > > > > > >> > > > - Then step 5 can use the new > > > > TaskExecutorGateway#requestResource > > > > > >> and > > > > > >> > the > > > > > >> > > > default slot profile > > > > > >> > > > > > > > > >> > > > - Not sure, step 5 and 7 can be implemented independently > > > > without > > > > > >> > > > regression of what we have. Maybe if we do step 7 firstly > it > > > > will > > > > > >> have > > > > > >> > > only > > > > > >> > > > default slots firstly and it will simplify step 5 later. > > > > > >> > > > > > > > > >> > > > Best, > > > > > >> > > > Andrey > > > > > >> > > > > > > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song < > > > > > tonysong...@gmail.com > > > > > >> > > > > > > >> > > > wrote: > > > > > >> > > > > > > > > >> > > > > Thanks for the comments, Till and Wenlong. > > > > > >> > > > > > > > > > >> > > > > @Wenlong > > > > > >> > > > > Regarding slot sharing, the general idea is to request a > > > slot > > > > > with > > > > > >> > > > > resources for tasks of the entire slot sharing group. > > > Details > > > > > can > > > > > >> be > > > > > >> > > > found > > > > > >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing > > > > groups > > > > > >> and > > > > > >> > how > > > > > >> > > > to > > > > > >> > > > > manage task resources within the shared slots. > > > > > >> > > > > > > > > > >> > > > > Thank you~ > > > > > >> > > > > > > > > > >> > > > > Xintong Song > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > > > > > >> > wenlong88....@gmail.com> > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for > > the > > > > > >> feature! > > > > > >> > > It > > > > > >> > > > is > > > > > >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > > >> > > > > > > > > > > >> > > > > > I like the design on the whole. One point may need to > be > > > > > >> included > > > > > >> > in > > > > > >> > > > the > > > > > >> > > > > > proposal:How we deal with slot share group and dynamic > > > slot > > > > > >> > > allocation? > > > > > >> > > > > It > > > > > >> > > > > > can be quite different with dynamic slot allocation. > > > > > >> > > > > > > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > > > > > >> trohrm...@apache.org> > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > > >> > > > > > > Thanks for the update Xintong. From a high level > > > > perspective > > > > > >> the > > > > > >> > > > > > > implementation plan looks good to me. > > > > > >> > > > > > > > > > > > >> > > > > > > Cheers, > > > > > >> > > > > > > Till > > > > > >> > > > > > > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > > > > >> > > tonysong...@gmail.com > > > > > >> > > > > > > > > > >> > > > > > > wrote: > > > > > >> > > > > > > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the > wiki > > > > page > > > > > >> [1]. > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > Thank you~ > > > > > >> > > > > > > > > > > > > >> > > > > > > > Xintong Song > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > [1] > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > > > > >> > > > tonysong...@gmail.com> > > > > > >> > > > > > > > wrote: > > > > > >> > > > > > > > > > > > > >> > > > > > > > > @Zili > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that > has > > > > taken > > > > > >> the > > > > > >> > > > number > > > > > >> > > > > > 55. > > > > > >> > > > > > > > > There is a round-up number maintained on the > FLIP > > > wiki > > > > > >> page > > > > > >> > [1] > > > > > >> > > > > shows > > > > > >> > > > > > > > > which number should be used for the new FLIP, > > which > > > > > >> should be > > > > > >> > > > > > increased > > > > > >> > > > > > > > by > > > > > >> > > > > > > > > whoever takes the number for a new FLIP. > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Thank you~ > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Xintong Song > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > [1] > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > > > > >> > > wander4...@gmail.com> > > > > > >> > > > > > > wrote: > > > > > >> > > > > > > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> Xintong Song <tonysong...@gmail.com> > > 于2019年8月19日周一 > > > > > >> > 下午10:23写道: > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > Hi everyone, > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > We would like to start a discussion thread on > > > > > "FLIP-56: > > > > > >> > > > Dynamic > > > > > >> > > > > > Slot > > > > > >> > > > > > > > >> > Allocation" [1]. This is originally part of > the > > > > > >> discussion > > > > > >> > > > > thread > > > > > >> > > > > > > for > > > > > >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" > > [2]. > > > As > > > > > >> Till > > > > > >> > > > > > suggested, > > > > > >> > > > > > > we > > > > > >> > > > > > > > >> > would like split the original discussion into > > two > > > > > >> topics, > > > > > >> > > and > > > > > >> > > > > > start > > > > > >> > > > > > > a > > > > > >> > > > > > > > >> > separate new discussion thread as well as > FLIP > > > > > process > > > > > >> for > > > > > >> > > > this > > > > > >> > > > > > one. > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > Thank you~ > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > Xintong Song > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > [1] > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > [2] > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > Regards, > > Tao > > > -- Regards, Tao