Clint Byrum <cl...@fewbar.com> wrote on 09/27/2013 11:58:16 AM: > From: Clint Byrum <cl...@fewbar.com> > To: openstack-dev <openstack-dev@lists.openstack.org>, > Date: 09/27/2013 12:01 PM > Subject: Re: [openstack-dev] [scheduler] [heat] Policy specifics > ... > > Mike, > > These are not the kinds of specifics that are of any help at all in > > figuring out how (or, indeed, whether) to incorporate holistic > > scheduling into OpenStack. > > I agree that the things in that page are a wet dream of logical deployment > fun. However, I think one can target just a few of the basic ones, > and see a real achievable case forming. I think I grasp Mike's ideas, > so I'll respond to your concerns with what I think. Note that it is > highly likely I've gotten some of this wrong.
It remains to be seen whether those things can be anything more than a wet dream for OpenStack, but they are running code elsewhere, so I have hope. What I wrote is pretty much a dump of what we have. The exception is the network bandwidth stuff, which our holistic infrastructure scheduler currently ignores because we do not have a way to get the relevant capacity information from the physical infrastructure. Part of the agenda here is to nudge Neutron to improve in that way. > > - What would a holistic scheduling service look like? A standalone > > service? Part of heat-engine? > > I see it as a preprocessor of sorts for the current infrastructure engine. > It would take the logical expression of the cluster and either turn > it into actual deployment instructions or respond to the user that it > cannot succeed. Ideally it would just extend the same Heat API. My own expectation is that it would be its own service, preceding infrastructure orchestration in the flow. Alternatively, we could bundle holistic infrastructure scheduling, infrastructure orchestration, and software orchestration preparation together under one API but still maintained as fairly separate modules of functionality. Or various in between ideas. I do not yet have a strong reason for one choice over another. I have been looking to gain cluefulness from discussion with you folks. > > - How will the scheduling service reserve slots for resources in advance > > of them being created? How will those reservations be accounted for and > > billed? > > - In the event that slots are reserved but those reservations are not > > taken up, what will happen? > > I dont' see the word "reserve" in Mike's proposal, and I don't think this > is necessary for the more basic models like Collocation and Anti-Collocation. > > Reservations would of course make the scheduling decisions more likely to > succeed, but it isn't necessary if we do things optimistically. If the > stack create or update fails, we can retry with better parameters. The raw truth of the matter is that even Nova has this problem already. The real ground truth of resource usage is in the hypervisor, not Nova. When Nova makes a decision, it really is provisional until confirmed by the hypervisor. I have heard of cases, in different cloud software, where the thing making the placement decisions does not have a truly accurate picture of the resource usage. These are typically caused by corner cases in failure scenarios, where the decision maker thinks something did not happen or was successfully deleted but in reality there is a zombie left over consuming some resources in the hypervisor. There are probably cases where this can happen in OpenStack too, I am guessing. Also, OpenStack does not prevent someone from going around Nova and directly asking a hypervisor to do something. > > - Once scheduled, how will resources be created in their proper slots as > > part of a Heat template? > > In goes a Heat template (sorry for not using HOT.. still learning it. ;) > > Resources: > ServerTemplate: > Type: Some::Defined::ProviderType > HAThing1: > Type: OS::Heat::HACluster > Properties: > ClusterSize: 3 > MaxPerAZ: 1 > PlacementStrategy: anti-collocation > Resources: [ ServerTemplate ] > > And if we have at least 2 AZ's available, it feeds to the heat engine: > > Resources: > HAThing1-0: > Type: Some::Defined::ProviderType > Parameters: > availability-zone: zone-A > HAThing1-1: > Type: Some::Defined::ProviderType > Parameters: > availability-zone: zone-B > HAThing1-2: > Type: Some::Defined::ProviderType > Parameters: > availability-zone: zone-A > > If not, holistic scheduler says back "I don't have enough AZ's to > satisfy MaxPerAZ". Actually, I was thinking something even simpler (in the simple cases :-). By simple cases I mean where the holistic infrastructure scheduler makes all the placement decisions. In that case, it only needs to get Nova to implement the decisions already made. So the API call or template fragment for a VM instance would include an AZ parameter that specifies the particular host already chosen for that VM instance. Similarly for Cinder, except that its handling of AZ has been broken. But I hear that is or will be fixed. In the meantime it is possible to abuse volume types to get this job done. > Now, if Nova grows anti-affininty under the covers that it can manage > directly, a later version can just spit out: > > Resources: > HAThing1-0: > Type: Some::Defined::ProviderType > Parameters: > instance-group: 0 > affinity-type: anti > HAThing1-1: > Type: Some::Defined::ProviderType > Parameters: > instance-group: 1 > affinity-type: anti > HAThing1-2: > Type: Some::Defined::ProviderType > Parameters: > instance-group: 0 > affinity-type: anti Yes, if there are no strong interactions between the placement of certain VMs and other non-Nova placement decisions then the placement decision-making for those VMs can be deferred to Nova (provided Nova has a rich enough interface). I do not follow exactly your thinking in your example, but I think we are agreed on the principle. > The point is that the user cares about their servers not being in the > same failure domain, not how that happens. Right. As you point out later, that is kind of the big picture here. > > - What about when the user calls the APIs directly? (i.e. does their own > > orchestration - either hand-rolled or using their own standalone Heat.) > > This has come up with autoscaling too. "Undefined" - that's not your stack. This may be another face of the concern behind "reservation". There is a significant issue to discuss around how a holistic infrastructure scheduler (indeed, any scheduler really) interacts with something that goes around it to the underlying resources. It is tempting to suggest multiple independent managers can somehow cooperate in managing a common pool of resources, but this rarely works out well. I think the practical solution is to focus on one manager for any given resource. But that manager must cope somewhat gracefully with surprises because they can happen (as I mentioned above). > > - How and from where will the scheduling service obtain the utilisation > > data needed to perform the scheduling? What mechanism will segregate > > this information from the end user? > > I do think this is a big missing piece. Right now it is spread out > all over the place. Keystone at least has regions, so that could be > incorporated now. I briefly dug through the other API's and don't see > a way to enumerate AZ's or cells. Perhaps it is hiding in extensions? > > I don't think this must be segregated from end users. An API for "show > me the placement decisions I can make" seems useful for anybody trying > to automate deployments. Anyway, probably best to keep it decentralized > and just make it so that each service can respond with lists of arguments > to their API that are likely to succeed. First, I think "utilization" is not the best word for what matters here. CPU utilization, for example, is something that fluctuates fairly quickly. VM placement should be based on long-term allocation decisions. Those might be informed by utilization information, but they are a distinct thing. Isn't it true today that Nova packs VMs onto a hypervisor by comparing virtual CPUs with real CPUs (multiplied by some configured overcommittment factor)? That is an example of the sort of allocation-based decision making I am talking about. It does not require new utilization information from anyone; it requires the scheduler to keep track of the allocations it has made --- and the allocations it has discovered someone else has made too. For the latter, I think two mechanisms are good. First, the underlying resource service should be able to report all the allocations (e.g., the nova compute agents should be able to report what VM instances are already running, regardless of who started them). Second, recognize that any software can get confused, including the resource service; the scheduler should be able to formulate and keep track of an adjustment to what the resource service is saying. This second point is a fine point, no need to worry about it at first if Nova is not already doing such a thing. > > - How will users communicate their scheduling constraints to OpenStack? > > (Through which API and in what form?) > > See above. Via the Heat API, a Heat-ish template that is turned into > concrete Heat instructions. Yes, this seems like a repeat of above discussion. > > - What value does this provide (over and above non-holistic scheduler > > hints passed in individual API calls) to end users? Public cloud > > operators? Private cloud operators? How might the value be shared > > between users and operators, and how would that be accounted for? > > See above, logically expressing what you actually want means the tool > can improve its response to that. Always expressing things concretely > means that improvements on the backend are harder to realize. > > Ultimately, it is an end-user tool, but the benefit to a cloud operator > could be significant. If one AZ is getting flooded, one can stop > responding to it, or add hints in the API ranking the AZ lower than > the others in preference. Users using the holistic scheduler will begin > using the new AZ without having to be educated about it. Yes. Even though a public cloud would not expose all the information that a holistic infrastructure scheduler needs to do its job (rather the holistic infrastructure scheduler would be part of the public cloud's service), it can accept templates that involve holistic infrastructure scheduling. It is a separation of concerns play: the template author says what he wants without getting overly specific about how it gets done. > > - Does this fit within the scope of an existing OpenStack program? Which > > one? Why? > > Heat. You want users to use holistic scheduling when it can work for them, > so having it just be a tweak to their templates is a win. I think this is actually a pretty interesting question. If we recognize that the heat program has a bigger view (all kinds of orchestration) than today's heat engine (infrastructure orchestration), this can partly help untangle heat/not-heat debate. Holistic infrastructure scheduling is a form of scheduling, and the nova scheduler group has some interest in it, but it is inherently not limited to nova. I think it fits best between software orchestration preparation and infrastructure orchestration --- in the middle of the interests of the heat program. I think we may want to recognize that the best flow of processing does not necessarily intersect the interests of a given program in only one contiguous region. > > - What changes are required to existing services to accommodate this > > functionality? > > > > More exposure of what details can be exploited. Yes, and the level of control needed to do that exploitation. The meta-model in that policy document identifies the level of details involved. Regards, Mike
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev