Excerpts from Chris Friesen's message of 2013-11-19 12:18:16 -0800: > On 11/19/2013 01:51 PM, Clint Byrum wrote: > > Excerpts from Chris Friesen's message of 2013-11-19 11:37:02 -0800: > >> On 11/19/2013 12:35 PM, Clint Byrum wrote: > >> > >>> Each scheduler process can own a different set of resources. If they > >>> each grab instance requests in a round-robin fashion, then they will > >>> fill their resources up in a relatively well balanced way until one > >>> scheduler's resources are exhausted. At that time it should bow out of > >>> taking new instances. If it can't fit a request in, it should kick the > >>> request out for retry on another scheduler. > >>> > >>> In this way, they only need to be in sync in that they need a way to > >>> agree on who owns which resources. A distributed hash table that gets > >>> refreshed whenever schedulers come and go would be fine for that. > >> > >> That has some potential, but at high occupancy you could end up refusing > >> to schedule something because no one scheduler has sufficient resources > >> even if the cluster as a whole does. > >> > > > > I'm not sure what you mean here. What resource spans multiple compute > > hosts? > > Imagine the cluster is running close to full occupancy, each scheduler > has room for 40 more instances. Now I come along and issue a single > request to boot 50 instances. The cluster has room for that, but none > of the schedulers do. >
You're assuming that all 50 come in at once. That is only one use case and not at all the most common. > >> This gets worse once you start factoring in things like heat and > >> instance groups that will want to schedule whole sets of resources > >> (instances, IP addresses, network links, cinder volumes, etc.) at once > >> with constraints on where they can be placed relative to each other. > > > Actually that is rather simple. Such requests have to be serialized > > into a work-flow. So if you say "give me 2 instances in 2 different > > locations" then you allocate 1 instance, and then another one with > > 'not_in_location(1)' as a condition. > > Actually, you don't want to serialize it, you want to hand the whole set > of resource requests and constraints to the scheduler all at once. > > If you do them one at a time, then early decisions made with > less-than-complete knowledge can result in later scheduling requests > failing due to being unable to meet constraints, even if there are > actually sufficient resources in the cluster. > > The "VM ensembles" document at > https://docs.google.com/document/d/1bAMtkaIFn4ZSMqqsXjs_riXofuRvApa--qo4UTwsmhw/edit?pli=1 > > has a good example of how one-at-a-time scheduling can cause spurious > failures. > > And if you're handing the whole set of requests to a scheduler all at > once, then you want the scheduler to have access to as many resources as > possible so that it has the highest likelihood of being able to satisfy > the request given the constraints. This use case is real and valid, which is why I think there is room for multiple approaches. For instance the situation you describe can also be dealt with by just having the cloud stay under-utilized and accepting that when you get over a certain percentage utilized spurious failures will happen. We have a similar solution in the ext3 filesystem on Linux. Don't fill it up, or suffer a huge performance penalty. _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev