Thanks for reply, Till and Xintong! I update the FLIP, including: - Edit the JavaDoc of the proposed StreamGraphGenerator#setSlotSharingGroupResource. - Add "Future Plan" section, which contains the potential follow-up issues and the limitations to be documented when fine-grained resource management is exposed to users.
I'll start a vote in another thread. Best, Yangze Guo On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <trohrm...@apache.org> wrote: > > Thanks for summarizing the discussion, Yangze. I agree that setting > resource requirements per operator is not very user friendly. Moreover, I > couldn't come up with a different proposal which would be as easy to use > and wouldn't expose internal scheduling details. In fact, following this > argument then we shouldn't have exposed the slot sharing groups in the > first place. > > What is important for the user is that we properly document the limitations > and constraints the fine grained resource specification has. For example, > we should explain how optimizations like chaining are affected by it and > how different execution modes (batch vs. streaming) affect the execution of > operators which have specified resources. These things shouldn't become > part of the contract of this feature and are more caused by internal > implementation details but it will be important to understand these things > properly in order to use this feature effectively. > > Hence, +1 for starting the vote for this FLIP. > > Cheers, > Till > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <tonysong...@gmail.com> wrote: > > > Thanks for the summary, Yangze. > > > > The changes and follow-up issues LGTM. Let's wait for responses from the > > others before starting a vote. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <karma...@gmail.com> wrote: > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > summarize the current convergence in the discussion. Please let me > > > know if I got things wrong or missed something crucial here. > > > > > > Change of this FLIP: > > > - Treat the SSG resource requirements as a hint instead of a > > > restriction for the runtime. That's should be explicitly explained in > > > the JavaDocs. > > > > > > Potential follow-up issues if needed: > > > - Provide operator-level resource configuration interface. > > > - Provide multiple options for deciding resources for SSGs whose > > > requirement is not specified: > > > ** Default slot resource. > > > ** Default operator resource times number of operators. > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > Best, > > > Yangze Guo > > > > > > Best, > > > Yangze Guo > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <tonysong...@gmail.com> > > > wrote: > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <tonysong...@gmail.com> > > > wrote: > > > >> > > > >> I think Chesnay's proposal could actually work. IIUC, the keypoint is > > > to derive operator requirements from SSG requirements on the API side, so > > > that the runtime only deals with operator requirements. It's debatable > > how > > > the deriving should be done though. E.g., an alternative could be to > > evenly > > > divide the SSG requirement into requirements of operators in the group. > > > >> > > > >> > > > >> However, I'm not entirely sure which option is more desired. > > > Illustrating my understanding in the following figure, in which on the > > top > > > is Chesnay's proposal and on the bottom is the SSG-based proposal in this > > > FLIP. > > > >> > > > >> > > > >> > > > >> I think the major difference between the two approaches is where > > > deriving operator requirements from SSG requirements happens. > > > >> > > > >> - Chesnay's proposal simplifies the runtime logic and the interface to > > > expose, at the price of moving more complexity (i.e. the deriving) to the > > > API side. The question is, where do we prefer to keep the complexity? I'm > > > slightly leaning towards having a thin API and keep the complexity in > > > runtime if possible. > > > >> > > > >> - Notice that the dash line arrows represent optional steps that are > > > needed only for schedulers that do not respect SSGs, which we don't have > > at > > > the moment. If we only look at the solid line arrows, then the SSG-based > > > approach is much simpler, without needing to derive and aggregate the > > > requirements back and forth. I'm not sure about complicating the current > > > design only for the potential future needs. > > > >> > > > >> > > > >> Thank you~ > > > >> > > > >> Xintong Song > > > >> > > > >> > > > >> > > > >> > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <ches...@apache.org> > > > wrote: > > > >>> > > > >>> You're raising a good point, but I think I can rectify that with a > > > minor > > > >>> adjustment. > > > >>> > > > >>> Default requirements are whatever the default requirements are, > > setting > > > >>> the requirements for one operator has no effect on other operators. > > > >>> > > > >>> With these rules, and some API enhancements, the following mockup > > would > > > >>> replicate the SSG-based behavior: > > > >>> > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > >>> vertices = slotSharingGroup.getVertices() > > > >>> > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > >>> vertices.remainint().setRequirements(ZERO) > > > >>> } > > > >>> > > > >>> We could even allow setting requirements on slotsharing-groups > > > >>> colocation-groups and internally translate them accordingly. > > > >>> I can't help but feel this is a plain API issue. > > > >>> > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > >>> > If I understand you correctly Chesnay, then you want to decouple > > the > > > >>> > resource requirement specification from the slot sharing group > > > >>> > assignment. Hence, per default all operators would be in the same > > > slot > > > >>> > sharing group. If there is no operator with a resource > > specification, > > > >>> > then the system would allocate a default slot for it. If there is > > at > > > >>> > least one operator, then the system would sum up all the specified > > > >>> > resources and allocate a slot of this size. This effectively means > > > >>> > that all unspecified operators will implicitly have a zero resource > > > >>> > requirement. Did I understand your idea correctly? > > > >>> > > > > >>> > I am wondering whether this wouldn't lead to a surprising behaviour > > > >>> > for the user. If the user specifies the resource requirements for a > > > >>> > single operator, then he probably will assume that the other > > > operators > > > >>> > will get the default share of resources and not nothing. > > > >>> > > > > >>> > Cheers, > > > >>> > Till > > > >>> > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > ches...@apache.org > > > >>> > <mailto:ches...@apache.org>> wrote: > > > >>> > > > > >>> > Is there even a functional difference between specifying the > > > >>> > requirements for an SSG vs specifying the same requirements on > > a > > > >>> > single > > > >>> > operator within that group (ideally a colocation group to avoid > > > this > > > >>> > whole hint business)? > > > >>> > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > >>> > > > > >>> > Users can take shortcuts to define shared requirements, > > > >>> > but refine them further as needed on a per-operator basis, > > > >>> > without changing semantics of slotsharing groups > > > >>> > nor the runtime being locked into SSG-based requirements. > > > >>> > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > >>> > change or > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > (A > > > >>> > plain > > > >>> > iteration over slotsharing groups and therein contained > > operators > > > >>> > would > > > >>> > suffice)). > > > >>> > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > the SSG > > > >>> > > resource requirements as a hint for the runtime similar to > > how > > > >>> > slot sharing > > > >>> > > groups are designed at the moment? Meaning that we don't give > > > >>> > the guarantee > > > >>> > > that Flink will always deploy this set of tasks together no > > > >>> > matter what > > > >>> > > comes. If, for example, the runtime can derive by some means > > > the > > > >>> > resource > > > >>> > > requirements for each task based on the requirements for the > > > >>> > SSG, this > > > >>> > > could be possible. One easy strategy would be to give every > > > task > > > >>> > the same > > > >>> > > resources as the whole slot sharing group. Another one could > > be > > > >>> > > distributing the resources equally among the tasks. This does > > > >>> > not even have > > > >>> > > to be implemented but we would give ourselves the freedom to > > > change > > > >>> > > scheduling if need should arise. > > > >>> > > > > > >>> > > Cheers, > > > >>> > > Till > > > >>> > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > karma...@gmail.com > > > >>> > <mailto:karma...@gmail.com>> wrote: > > > >>> > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > >>> > >> > > > >>> > >> I second Xintong's comment that SSG-based runtime interface > > > >>> > will give > > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > > one of > > > >>> > >> the most important reasons for our design choice. > > > >>> > >> > > > >>> > >> Some cents regarding the default operator resource: > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > >>> > >> ** For light-weight operators, the accumulative > > > >>> > configuration error > > > >>> > >> will not be significant. Then, the resource of a task used > > is > > > >>> > >> proportional to the number of operators it contains. > > > >>> > >> ** For heavy operators like join and window or operators > > > >>> > using the > > > >>> > >> external resources, user will turn to the fine-grained > > > resource > > > >>> > >> configuration. > > > >>> > >> - It can increase the stability for the standalone cluster > > > >>> > where task > > > >>> > >> executors registered are heterogeneous(with different > > default > > > slot > > > >>> > >> resources). > > > >>> > >> - It might not be good for SQL users. The operators that SQL > > > >>> > will be > > > >>> > >> transferred to is a black box to the user. We also do not > > > guarantee > > > >>> > >> the cross-version of consistency of the transformation so > > far. > > > >>> > >> > > > >>> > >> I think it can be treated as a follow-up work when the > > > fine-grained > > > >>> > >> resource management is end-to-end ready. > > > >>> > >> > > > >>> > >> Best, > > > >>> > >> Yangze Guo > > > >>> > >> > > > >>> > >> > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com>> > > > >>> > >> wrote: > > > >>> > >>> Thanks for the feedback, Till. > > > >>> > >>> > > > >>> > >>> ## I feel that what you proposed (operator-based + default > > > >>> > value) might > > > >>> > >> be > > > >>> > >>> subsumed by the SSG-based approach. > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > > >>> > categorized by > > > >>> > >>> whether the resource requirements are known to the users. > > > >>> > >>> > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > >>> > reason to put > > > >>> > >>> multiple operators whose individual resource > > requirements > > > >>> > are already > > > >>> > >> known > > > >>> > >>> into the same group in fine-grained resource > > management. > > > >>> > And if op_1 > > > >>> > >> and > > > >>> > >>> op_2 are in different groups, there should be no > > problem > > > >>> > switching > > > >>> > >> data > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > >>> > equivalent to > > > >>> > >> specifying > > > >>> > >>> operator resource requirements in your proposal. > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > that > > > >>> > op_2 is in a > > > >>> > >>> SSG whose resource is not specified thus would have the > > > >>> > default slot > > > >>> > >>> resource. This is equivalent to having default operator > > > >>> > resources in > > > >>> > >> your > > > >>> > >>> proposal. > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > op_2 > > > >>> > to the same > > > >>> > >> SSG > > > >>> > >>> or separate SSGs. > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > >>> > equivalent to > > > >>> > >> the > > > >>> > >>> coarse-grained resource management, where op_1 and > > > op_2 > > > >>> > share a > > > >>> > >> default > > > >>> > >>> size slot no matter which data exchange mode is > > used. > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > of > > > >>> > them will > > > >>> > >> use > > > >>> > >>> a default size slot. This is equivalent to setting > > > them > > > >>> > with > > > >>> > >> default > > > >>> > >>> operator resources in your proposal. > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > is > > > >>> > known.* > > > >>> > >>> - It is possible that the user learns the total / > > max > > > >>> > resource > > > >>> > >>> requirement from executing and monitoring the job, > > > >>> > while not > > > >>> > >>> being aware of > > > >>> > >>> individual operator requirements. > > > >>> > >>> - I believe this is the case your proposal does not > > > >>> > cover. And TBH, > > > >>> > >>> this is probably how most users learn the resource > > > >>> > requirements, > > > >>> > >>> according > > > >>> > >>> to my experiences. > > > >>> > >>> - In this case, the user might need to specify > > > >>> > different resources > > > >>> > >> if > > > >>> > >>> he wants to switch the execution mode, which should > > > not > > > >>> > be worse > > > >>> > >> than not > > > >>> > >>> being able to use fine-grained resource management. > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> ## An additional idea inspired by your proposal. > > > >>> > >>> We may provide multiple options for deciding resources for > > > >>> > SSGs whose > > > >>> > >>> requirement is not specified, if needed. > > > >>> > >>> > > > >>> > >>> - Default slot resource (current design) > > > >>> > >>> - Default operator resource times number of operators > > > >>> > (equivalent to > > > >>> > >>> your proposal) > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> ## Exposing internal runtime strategies > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > >>> > requirements might be > > > >>> > >>> affected if how SSGs are internally handled changes in > > > future. > > > >>> > >> Practically, > > > >>> > >>> I do not concretely see at the moment what kind of changes > > we > > > >>> > may want in > > > >>> > >>> future that might conflict with this FLIP proposal, as the > > > >>> > question of > > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > > >>> > not give up > > > >>> > >> the > > > >>> > >>> user friendliness we may gain now for the future problems > > > that > > > >>> > may or may > > > >>> > >>> not exist. > > > >>> > >>> > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > >>> > achieve the > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > set each > > > >>> > >> operator > > > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > > > >>> > option to > > > >>> > >>> automatically do that for users, if needed. > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> Thank you~ > > > >>> > >>> > > > >>> > >>> Xintong Song > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > >>> > <trohrm...@apache.org <mailto:trohrm...@apache.org>> > > > >>> > >> wrote: > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > >>> > >>>> > > > >>> > >>>> I agree that being able to define the resource > > requirements > > > for a > > > >>> > >> group of > > > >>> > >>>> operators is more user friendly. However, my concern is > > that > > > >>> > we are > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > >>> > limit our > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > semantics > > > of > > > >>> > >> configuring > > > >>> > >>>> resource requirements for SSGs could break if switching > > from > > > >>> > streaming > > > >>> > >> to > > > >>> > >>>> batch execution. If one defines the resource requirements > > > for > > > >>> > op_1 -> > > > >>> > >> op_2 > > > >>> > >>>> which run in pipelined mode when using the streaming > > > >>> > execution, then > > > >>> > >> how do > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > >>> > executed with a > > > >>> > >>>> blocking data exchange in batch execution mode? > > > Consequently, > > > >>> > I am > > > >>> > >> still > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > >>> > requirements per > > > >>> > >>>> operator. > > > >>> > >>>> > > > >>> > >>>> Maybe the following proposal makes the configuration > > easier: > > > >>> > If the > > > >>> > >> user > > > >>> > >>>> wants to use fine-grained resource requirements, then she > > > >>> > needs to > > > >>> > >> specify > > > >>> > >>>> the default size which is used for operators which have no > > > >>> > explicit > > > >>> > >>>> resource annotation. If this holds true, then every > > operator > > > >>> > would > > > >>> > >> have a > > > >>> > >>>> resource requirement and the system can try to execute the > > > >>> > operators > > > >>> > >> in the > > > >>> > >>>> best possible manner w/o being constrained by how the user > > > >>> > set the SSG > > > >>> > >>>> requirements. > > > >>> > >>>> > > > >>> > >>>> Cheers, > > > >>> > >>>> Till > > > >>> > >>>> > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com>> > > > >>> > >>>> wrote: > > > >>> > >>>> > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > >>> > >>>>> > > > >>> > >>>>> Actually, your proposal has also come to my mind at some > > > >>> > point. And I > > > >>> > >>>> have > > > >>> > >>>>> some concerns about it. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> 1. It does not give users the same control as the > > SSG-based > > > >>> > approach. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> While both approaches do not require specifying for each > > > >>> > operator, > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > operators > > > >>> > >> together > > > >>> > >>>> use > > > >>> > >>>>> this much resource" while the operator-based approach > > > doesn't. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > > > >>> > o_m), and > > > >>> > >> at > > > >>> > >>>> some > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > > >>> > reduces the > > > >>> > >> data > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > > >>> > (o_1, ..., > > > >>> > >> o_n) > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > higher > > > >>> > >> parallelisms > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > > >>> > lead to too > > > >>> > >> much > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > different > > > >>> > >> resources, > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > >>> > resources for > > > >>> > >> the > > > >>> > >>>> two > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > user will > > > >>> > >> have to > > > >>> > >>>>> specify resources for each operator in one of the two > > > >>> > groups, and > > > >>> > >> tune > > > >>> > >>>> the > > > >>> > >>>>> default slot resource via configurations to fit the other > > > group. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Setting chainnable operators into different slot sharing > > > >>> > groups will > > > >>> > >>>>> prevent them from being chained. In the current > > > implementation, > > > >>> > >>>> downstream > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > to > > > >>> > the same > > > >>> > >> group > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > upstream > > > >>> > >> operators > > > >>> > >>>> in > > > >>> > >>>>> different groups), to reduce the chance of breaking > > chains. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > > > >>> > deciding > > > >>> > >> SSGs > > > >>> > >>>>> based on whether resource is specified we will easily get > > > >>> > groups like > > > >>> > >>>> (o_1, > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > >>> > chained. This > > > >>> > >> is > > > >>> > >>>> also > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > >>> > chance is much > > > >>> > >>>>> smaller because there's no strong reason for users to > > > >>> > specify the > > > >>> > >> groups > > > >>> > >>>>> with alternate operators like that. We are more likely to > > > >>> > get groups > > > >>> > >> like > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > between > > > >>> > o_2 and > > > >>> > >> o_3. > > > >>> > >>>>> > > > >>> > >>>>> 3. It complicates the system by having two different > > > >>> > mechanisms for > > > >>> > >>>> sharing > > > >>> > >>>>> managed memory in a slot. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > memory > > > >>> > sharing > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > >>> > according to the > > > >>> > >>>>> consumer type, then further distributed across operators > > > of that > > > >>> > >> consumer > > > >>> > >>>>> type. > > > >>> > >>>>> > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > >>> > specified > > > >>> > >> for an > > > >>> > >>>>> operator should account for all the consumer types of > > that > > > >>> > operator. > > > >>> > >> That > > > >>> > >>>>> means the managed memory is first distributed across > > > >>> > operators, then > > > >>> > >>>>> distributed to different consumer types of each operator. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Unfortunately, the different order of the two calculation > > > >>> > steps can > > > >>> > >> lead > > > >>> > >>>> to > > > >>> > >>>>> different results. To be specific, the semantic of the > > > >>> > configuration > > > >>> > >>>> option > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > >>> > operator). > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> To sum up things: > > > >>> > >>>>> > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > >>> > think (1) > > > >>> > >> and (2) > > > >>> > >>>>> somehow suggest that, the price for the proposed approach > > > to > > > >>> > avoid > > > >>> > >>>>> specifying resource for every operator is that it's not > > as > > > >>> > >> independent > > > >>> > >>>> from > > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > > >>> > approach > > > >>> > >>>> discussed > > > >>> > >>>>> in the FLIP. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Thank you~ > > > >>> > >>>>> > > > >>> > >>>>> Xintong Song > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > >>> > <se...@apache.org <mailto:se...@apache.org>> > > > >>> > >> wrote: > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > >>> > >>>>>> > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > >>> > written. And > > > >>> > >> the > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > >>> > configuration to > > > >>> > >>>> users > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > >>> > >>>>>> So good job here! > > > >>> > >>>>>> > > > >>> > >>>>>> About how to let users specify the resource profiles. > > If I > > > >>> > can sum > > > >>> > >> the > > > >>> > >>>>> FLIP > > > >>> > >>>>>> and previous discussion up in my own words, the problem > > > is the > > > >>> > >>>> following: > > > >>> > >>>>>> Operator-level specification is the simplest and > > cleanest > > > >>> > approach, > > > >>> > >>>>> because > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > > >>> > >> scheduling. No > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > sharing, > > > >>> > >>>> switching > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > > >>> > stay the > > > >>> > >>>> same. > > > >>> > >>>>>>> But it would require that a user specifies resources on > > > all > > > >>> > >>>> operators, > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > suggests > > > going > > > >>> > >> with > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > >>> > >>>>>> > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > solution > > > >>> > >> where > > > >>> > >>>> the > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > >>> > still avoid > > > >>> > >> that > > > >>> > >>>>> we > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > >>> > >>>>>> > > > >>> > >>>>>> What do you think about something like the following: > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > level. > > > >>> > >>>>>> - Not all operators need profiles > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > in > > > the > > > >>> > >> default > > > >>> > >>>> slot > > > >>> > >>>>>> sharing group with a default profile (will get a default > > > slot). > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > >>> > another slot > > > >>> > >>>>> sharing > > > >>> > >>>>>> group (the resource-specified-group). > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > >>> > operators > > > >>> > >> like > > > >>> > >>>>> they > > > >>> > >>>>>> do now, with the exception that you cannot mix operators > > > >>> > that have > > > >>> > >> a > > > >>> > >>>>>> resource profile and operators that have no resource > > > profile. > > > >>> > >>>>>> - The default case where no operator has a resource > > > >>> > profile is > > > >>> > >> just a > > > >>> > >>>>>> special case of this model > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > operator, > > > >>> > like it > > > >>> > >> does > > > >>> > >>>>> now, > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > > > it > > > >>> > >> schedules > > > >>> > >>>>>> together. > > > >>> > >>>>>> > > > >>> > >>>>>> > > > >>> > >>>>>> There is another question about reactive scaling raised > > > in the > > > >>> > >> FLIP. I > > > >>> > >>>>> need > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > tricky > > > >>> > once we > > > >>> > >>>> have > > > >>> > >>>>>> slots of different sizes. > > > >>> > >>>>>> It is not clear then which of the different slot > > requests > > > the > > > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > > > >>> > show up, > > > >>> > >> or how > > > >>> > >>>>> the > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > resources > > > >>> > (TMs) > > > >>> > >>>>> disappear > > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > > to > > > >>> > specify > > > >>> > >> the > > > >>> > >>>>>> resources". > > > >>> > >>>>>> > > > >>> > >>>>>> > > > >>> > >>>>>> Best, > > > >>> > >>>>>> Stephan > > > >>> > >>>>>> > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > >>> > <tonysong...@gmail.com <mailto:tonysong...@gmail.com> > > > >>> > >>>>> wrote: > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > discussion, > > > >>> > Yangze. > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > >>> > >>>>>>> > > > >>> > >>>>>>> @Till, > > > >>> > >>>>>>> > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > that > > > SSGs > > > >>> > >> need to > > > >>> > >>>>> be > > > >>> > >>>>>>> supported in fine-grained resource management, > > otherwise > > > each > > > >>> > >>>> operator > > > >>> > >>>>>>> might use as many resources as the whole group. > > However, > > > I > > > >>> > cannot > > > >>> > >>>> think > > > >>> > >>>>>> of > > > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > > > >>> > resource > > > >>> > >>>>>>> management. > > > >>> > >>>>>>> > > > >>> > >>>>>>> > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > properly > > > >>> > >>>>>> specified, > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > could > > > >>> > >> slice off > > > >>> > >>>>> the > > > >>> > >>>>>>>> appropriately sized slots for every Task individually. > > > >>> > >>>>>>>> > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > >>> > operator op_1 > > > >>> > >> and > > > >>> > >>>>> op_2 > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > say > > > that > > > >>> > >> the > > > >>> > >>>> slot > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > have > > > a > > > >>> > >> cluster > > > >>> > >>>>> with > > > >>> > >>>>>> 2 > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > cannot run > > > >>> > >> this > > > >>> > >>>>> job. > > > >>> > >>>>>> If > > > >>> > >>>>>>>> the resources were specified on an operator level, > > then > > > the > > > >>> > >> system > > > >>> > >>>>>> could > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > op_2 > > > to > > > >>> > >> TM_2. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > > > are > > > >>> > >> properly > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > >>> > think this > > > >>> > >>>>> exactly > > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > > each > > > >>> > >> needs > > > >>> > >>>> 100 > > > >>> > >>>>> MB > > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > > >>> > they are > > > >>> > >> in > > > >>> > >>>>>> separate > > > >>> > >>>>>>> groups, with the proposed approach the system can > > freely > > > >>> > deploy > > > >>> > >> them > > > >>> > >>>> to > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > > > is > > > >>> > having > > > >>> > >>>>>> resource > > > >>> > >>>>>>> requirements properly specified for all operators. This > > > is not > > > >>> > >> always > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > > of the > > > >>> > >>>> benefits > > > >>> > >>>>>> for > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > freely > > > >>> > >> decide > > > >>> > >>>> the > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > >>> > consider SSG > > > >>> > >> in > > > >>> > >>>>>>> fine-grained resource management as a group of > > operators > > > >>> > that the > > > >>> > >>>> user > > > >>> > >>>>>>> would like to specify the total resource for. There can > > > be > > > >>> > only > > > >>> > >> one > > > >>> > >>>>> group > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > major > > > >>> > parts, > > > >>> > >> or as > > > >>> > >>>>>> many > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > how > > > >>> > >>>> fine-grained > > > >>> > >>>>>> the > > > >>> > >>>>>>> user is able to specify the resources. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > > > >>> > that all > > > >>> > >> the > > > >>> > >>>>>>> current scheduler implementations already support > > SSGs, I > > > >>> > tend to > > > >>> > >>>> think > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > >>> > usability and > > > >>> > >>>>>>> flexibility. > > > >>> > >>>>>>> > > > >>> > >>>>>>> @Chesnay > > > >>> > >>>>>>> > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > waste > > > >>> > >> resources > > > >>> > >>>> if > > > >>> > >>>>>> the > > > >>> > >>>>>>>> parallelism of operators within that group are > > > different? > > > >>> > >>>>>>>> > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > >>> > >> utilization. To > > > >>> > >>>>>> avoid > > > >>> > >>>>>>> such wasting, the user can define more groups, so that > > > >>> > each group > > > >>> > >>>>>> contains > > > >>> > >>>>>>> less operators and the chance of having operators with > > > >>> > different > > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > > >>> > resource > > > >>> > >>>>>>> requirements to specify. > > > >>> > >>>>>>> > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > >>> > >> recalculate the > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > create > > > >>> > >> a set > > > >>> > >>>>> of > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > their > > > >>> > >>>>> applications; > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > >>> > would be > > > >>> > >> a > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > >>> > >> requirements > > > >>> > >>>>> any > > > >>> > >>>>>>>> way. > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > increases > > > >>> > >>>>> usability. > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > there's no > > > >>> > >> reason to > > > >>> > >>>>> put > > > >>> > >>>>>>> multiple operators whose individual resource > > > >>> > requirements are > > > >>> > >>>>> already > > > >>> > >>>>>>> known > > > >>> > >>>>>>> into the same group in fine-grained resource > > > management. > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > multiple > > > >>> > >>>>> applications, > > > >>> > >>>>>>> it does not guarantee the same resource > > requirements. > > > >>> > During > > > >>> > >> our > > > >>> > >>>>> years > > > >>> > >>>>>>> of > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > requirements > > > >>> > >> specified for > > > >>> > >>>>>>> Blink's > > > >>> > >>>>>>> fine-grained resource management, very few users > > > >>> > (including > > > >>> > >> our > > > >>> > >>>>>>> specialists > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > >>> > >> experienced as > > > >>> > >>>>> to > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > >>> > >> requirements. > > > >>> > >>>> Most > > > >>> > >>>>>>> people > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > delay, cpu > > > >>> > >> load, > > > >>> > >>>>>> memory > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > specification. > > > >>> > >>>>>>> > > > >>> > >>>>>>> To sum up: > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > >>> > requirements > > > >>> > >> for > > > >>> > >>>>>> every > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > not > > > >>> > need to > > > >>> > >>>> rely > > > >>> > >>>>> on > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > >>> > >> fine-grained > > > >>> > >>>>>> resource > > > >>> > >>>>>>> management to work. For those users who are capable and > > > do not > > > >>> > >> like > > > >>> > >>>>>> having > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > to > > > have > > > >>> > >> both > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > > only > > > >>> > >> fallback > > > >>> > >>>> to > > > >>> > >>>>>> the > > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > > >>> > >> specified. > > > >>> > >>>>>> However, > > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > > cases > > > >>> > >> where > > > >>> > >>>>> users > > > >>> > >>>>>>> are not that experienced. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Thank you~ > > > >>> > >>>>>>> > > > >>> > >>>>>>> Xintong Song > > > >>> > >>>>>>> > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > >>> > >> ches...@apache.org <mailto:ches...@apache.org>> > > > >>> > >>>>>>> wrote: > > > >>> > >>>>>>> > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > waste > > > >>> > >> resources > > > >>> > >>>>> if > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > different? > > > >>> > >>>>>>>> > > > >>> > >>>>>>>> It also seems like quite a hassle for users having to > > > >>> > >> recalculate > > > >>> > >>>> the > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > create > > > >>> > >> a set > > > >>> > >>>>> of > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > their > > > >>> > >>>>> applications; > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > >>> > would be > > > >>> > >> a > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > >>> > >> requirements > > > >>> > >>>>> any > > > >>> > >>>>>>>> way. > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > increases > > > >>> > >>>>> usability. > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > work > > > >>> > on SSGs > > > >>> > >>>> it's > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > approaches, > > > >>> > >> which > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > always > > > >>> > >> defined > > > >>> > >>>> on > > > >>> > >>>>>> an > > > >>> > >>>>>>>> operator-level. > > > >>> > >>>>>>>> > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > discussion > > > >>> > >>>> Yangze. > > > >>> > >>>>>>>>> I like that defining resource requirements on a slot > > > sharing > > > >>> > >>>> group > > > >>> > >>>>>>> makes > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > resource > > > >>> > >>>>>>> requirements. > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > sharing > > > >>> > >>>> groups > > > >>> > >>>>>> from > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > be > > > >>> > >> supported > > > >>> > >>>> in > > > >>> > >>>>>>> order > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > far, > > > the > > > >>> > >> idea > > > >>> > >>>> of > > > >>> > >>>>>> slot > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > set > > > of > > > >>> > >>>> operators > > > >>> > >>>>>> can > > > >>> > >>>>>>>> be > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > the > > > >>> > >> freedom > > > >>> > >>>> to > > > >>> > >>>>>> say > > > >>> > >>>>>>>> that > > > >>> > >>>>>>>>> it would rather place these tasks in different slots > > > if it > > > >>> > >>>> wanted. > > > >>> > >>>>> If > > > >>> > >>>>>>> we > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > sharing > > > >>> > >> group, > > > >>> > >>>> then > > > >>> > >>>>>> the > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > slot > > > >>> > >> sharing > > > >>> > >>>>>> groups > > > >>> > >>>>>>> is > > > >>> > >>>>>>>>> to say that every operator in this slot sharing group > > > >>> > needs a > > > >>> > >>>> slot > > > >>> > >>>>>> with > > > >>> > >>>>>>>> the > > > >>> > >>>>>>>>> same resources as the whole group. > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > operator > > > >>> > >> op_1 > > > >>> > >>>>> and > > > >>> > >>>>>>> op_2 > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > say that > > > >>> > >> the > > > >>> > >>>>> slot > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > have a > > > >>> > >> cluster > > > >>> > >>>>>> with > > > >>> > >>>>>>> 2 > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > cannot run > > > >>> > >> this > > > >>> > >>>>>> job. > > > >>> > >>>>>>> If > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > then the > > > >>> > >>>> system > > > >>> > >>>>>>> could > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > op_2 to > > > >>> > >> TM_2. > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > > > groups > > > >>> > >> was > > > >>> > >>>> to > > > >>> > >>>>>> make > > > >>> > >>>>>>>> it > > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > > job > > > >>> > >> needs > > > >>> > >>>>>>>> independent > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > Interestingly, > > > >>> > >> if > > > >>> > >>>> all > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > then slot > > > >>> > >>>>> sharing > > > >>> > >>>>>> is > > > >>> > >>>>>>>> no > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > >>> > appropriately > > > >>> > >>>> sized > > > >>> > >>>>>>> slots > > > >>> > >>>>>>>>> for every Task individually. What matters is whether > > > the > > > >>> > >> whole > > > >>> > >>>>>> cluster > > > >>> > >>>>>>>> has > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> Cheers, > > > >>> > >>>>>>>>> Till > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > >>> > >> karma...@gmail.com <mailto:karma...@gmail.com>> > > > >>> > >>>>>> wrote: > > > >>> > >>>>>>>>>> Hi, there, > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > "FLIP-156: > > > >>> > >> Runtime > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > Requirements"[1], > > > >>> > >> where we > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > interfaces > > > >>> > >> for > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> In this FLIP: > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > >>> > >> management. > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > SSG-based > > > >>> > >> resource > > > >>> > >>>>>>>>>> requirements. > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > >>> > >> granularities > > > >>> > >>>>> for > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > slot > > > >>> > >> sharing > > > >>> > >>>>>> group) > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > [1]. > > > >>> > >> Looking > > > >>> > >>>>>>>>>> forward to your feedback. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> [1] > > > >>> > >>>>>>>>>> > > > >>> > >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > >>> > < > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > >>> > >>>>>>>>>> Best, > > > >>> > >>>>>>>>>> Yangze Guo > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>> > > > >>> > > > > >>> > > > > >