With this 4th sorter approach, how does quota work for scarce resources? — *Joris Van Remoortere* Mesosphere
On Thu, Jun 16, 2016 at 11:26 AM, Guangya Liu <gyliu...@gmail.com> wrote: > Hi Ben, > > The pre-condition for four stage allocation is that we need to put > different resources to different sorters: > > 1) roleSorter only include non scarce resources. > 2) quotaRoleSorter only include non revocable & non scarce resources. > 3) revocableSorter only include revocable & non scarce resources. This will > be handled in MESOS-4923 <https://issues.apache.org/jira/browse/MESOS-4923 > > > 4) scarceSorter only include scarce resources. > > Take your case above: > 999 agents with (cpus:4,mem:1024,disk:1024) > 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024) > > The four sorters would be: > 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024) > 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024) > 3) revocableSorter include nothing as I have no revocable resources here. > 4) scarceSorter include 1 agent with (gpus:1) > > When allocate resources, even if a role got the agent with gpu resources, > its share will only be counter by scarceSorter but not other sorters, and > will not impact other sorters. > > The above solution is actually kind of enhancement to "exclude scarce > resources" as the scarce resources also obey the DRF algorithm with this. > > This solution can be also treated as diving the whole resources pool > logically to scarce and non scarce resource pool. 1), 2) and 3) will handle > non scarce resources while 4) focus on scarce resources. > > Thanks, > > Guangya > > On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler <bmah...@apache.org> > wrote: > > > Hm.. can you expand on how adding another allocation stage for only > scarce > > resources would behave well? It seems to have a number of problems when I > > think through it. > > > > On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu <gyliu...@gmail.com> wrote: > > > >> Hi Ben, > >> > >> For long term goal, instead of creating sub-pool, what about adding a > new > >> sorter to handle **scare** resources? The current logic in allocator was > >> divided to two stages: allocation for quota, allocation for non quota > >> resources. > >> > >> I think that the future logic in allocator would be divided to four > >> stages: > >> 1) allocation for quota > >> 2) allocation for reserved resources > >> 3) allocation for revocable resources > >> 4) allocation for scare resources > >> > >> Thanks, > >> > >> Guangy > >> > >> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler <bmah...@apache.org> > >> wrote: > >> > >>> I wanted to start a discussion about the allocation of "scarce" > >>> resources. "Scarce" in this context means resources that are not > present on > >>> every machine. GPUs are the first example of a scarce resource that we > >>> support as a known resource type. > >>> > >>> Consider the behavior when there are the following agents in a cluster: > >>> > >>> 999 agents with (cpus:4,mem:1024,disk:1024) > >>> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024) > >>> > >>> Here there are 1000 machines but only 1 has GPUs. We call GPUs a > >>> "scarce" resource here because they are only present on a small > percentage > >>> of the machines. > >>> > >>> We end up with some problematic behavior here with our current > >>> allocation model: > >>> > >>> (1) If a role wishes to use both GPU and non-GPU resources for > >>> tasks, consuming 1 GPU will lead DRF to consider the role to have a > 100% > >>> share of the cluster, since it consumes 100% of the GPUs in the > cluster. > >>> This framework will then not receive any other offers. > >>> > >>> (2) Because we do not have revocation yet, if a framework decides > to > >>> consume the non-GPU resources on a GPU machine, it will prevent the GPU > >>> workloads from running! > >>> > >>> -------- > >>> > >>> I filed an epic [1] to track this. The plan for the short-term is to > >>> introduce two mechanisms to mitigate these issues: > >>> > >>> -Introduce a resource fairness exclusion list. This allows the > >>> shares of resources like "gpus" to be excluded from the dominant share. > >>> > >>> -Introduce a GPU_AWARE framework capability. This indicates that > the > >>> scheduler is aware of GPUs and will schedule tasks accordingly. Old > >>> schedulers will not have the capability and will not receive any > offers for > >>> GPU machines. If a scheduler has the capability, we'll advise that they > >>> avoid placing their additional non-GPU workloads on the GPU machines. > >>> > >>> -------- > >>> > >>> Longer term, we'll want a more robust way to manage scarce resources. > >>> The first thought we had was to have sub-pools of resources based on > >>> machine profile and perform fair sharing / quota within each pool. This > >>> addresses (1) cleanly, and for (2) the operator needs to explicitly > >>> disallow non-GPU frameworks from participating in the GPU pool. > >>> > >>> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may > >>> have a lower level of utilization. In the even longer term, as we add > >>> revocation it will be possible to allow a scheduler desiring GPUs to > revoke > >>> the resources allocated to the non-GPU workloads running on the GPU > >>> machines. There are a number of things we need to put in place to > support > >>> revocation ([2], [3], [4], etc), so I'm glossing over the details here. > >>> > >>> If anyone has any thoughts or insight in this area, please share! > >>> > >>> Ben > >>> > >>> [1] https://issues.apache.org/jira/browse/MESOS-5377 > >>> [2] https://issues.apache.org/jira/browse/MESOS-5524 > >>> [3] https://issues.apache.org/jira/browse/MESOS-5527 > >>> [4] https://issues.apache.org/jira/browse/MESOS-4392 > >>> > >> > >> > > >