Hi Ben, For long term goal, instead of creating sub-pool, what about adding a new sorter to handle **scare** resources? The current logic in allocator was divided to two stages: allocation for quota, allocation for non quota resources.
I think that the future logic in allocator would be divided to four stages: 1) allocation for quota 2) allocation for reserved resources 3) allocation for revocable resources 4) allocation for scare resources Thanks, Guangy On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler <bmah...@apache.org> wrote: > I wanted to start a discussion about the allocation of "scarce" resources. > "Scarce" in this context means resources that are not present on every > machine. GPUs are the first example of a scarce resource that we support as > a known resource type. > > Consider the behavior when there are the following agents in a cluster: > > 999 agents with (cpus:4,mem:1024,disk:1024) > 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024) > > Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce" > resource here because they are only present on a small percentage of the > machines. > > We end up with some problematic behavior here with our current allocation > model: > > (1) If a role wishes to use both GPU and non-GPU resources for tasks, > consuming 1 GPU will lead DRF to consider the role to have a 100% share of > the cluster, since it consumes 100% of the GPUs in the cluster. This > framework will then not receive any other offers. > > (2) Because we do not have revocation yet, if a framework decides to > consume the non-GPU resources on a GPU machine, it will prevent the GPU > workloads from running! > > -------- > > I filed an epic [1] to track this. The plan for the short-term is to > introduce two mechanisms to mitigate these issues: > > -Introduce a resource fairness exclusion list. This allows the shares > of resources like "gpus" to be excluded from the dominant share. > > -Introduce a GPU_AWARE framework capability. This indicates that the > scheduler is aware of GPUs and will schedule tasks accordingly. Old > schedulers will not have the capability and will not receive any offers for > GPU machines. If a scheduler has the capability, we'll advise that they > avoid placing their additional non-GPU workloads on the GPU machines. > > -------- > > Longer term, we'll want a more robust way to manage scarce resources. The > first thought we had was to have sub-pools of resources based on machine > profile and perform fair sharing / quota within each pool. This addresses > (1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU > frameworks from participating in the GPU pool. > > Unfortunately, by excluding non-GPU frameworks from the GPU pool we may > have a lower level of utilization. In the even longer term, as we add > revocation it will be possible to allow a scheduler desiring GPUs to revoke > the resources allocated to the non-GPU workloads running on the GPU > machines. There are a number of things we need to put in place to support > revocation ([2], [3], [4], etc), so I'm glossing over the details here. > > If anyone has any thoughts or insight in this area, please share! > > Ben > > [1] https://issues.apache.org/jira/browse/MESOS-5377 > [2] https://issues.apache.org/jira/browse/MESOS-5524 > [3] https://issues.apache.org/jira/browse/MESOS-5527 > [4] https://issues.apache.org/jira/browse/MESOS-4392 >