Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

Matt Riedemann Mon, 19 Jun 2017 07:39:39 -0700

On 6/19/2017 9:17 AM, Jay Pipes wrote:

On 06/19/2017 09:04 AM, Edward Leafe wrote:

Current flow:

As noted in the nova-scheduler meeting this morning, this should havebeen called "original plan" rather than "current flow", as Jay pointedout inline.

* Scheduler gets a req spec from conductor, containing resourcerequirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfythose requirements
Not root RPs. Non-sharing resource providers, which currentlyeffectively means compute node providers. Nested resource providersisn't yet merged, so there is currently no concept of a hierarchy ofproviders.
* Placement returns a list of the UUIDs for those root providers toscheduler
It returns the provider names and UUIDs, yes.
* Scheduler uses those UUIDs to create HostState objects for each
Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passingin a list of the provider UUIDs it got back from the placement service.The scheduler then builds a set of HostState objects from the results ofComputeNodeList.get_all_by_uuid().
The scheduler also keeps a set of AggregateMetadata objects in memory,including the association of aggregate to host (note: this is thecompute node's *service*, not the compute node object itself, thus thereason aggregates don't work properly for Ironic nodes).
* Scheduler runs those HostState objects through filters to removethose that don't meet requirements not selected for by placement
Yep.
* Scheduler runs the remaining HostState objects through weighers toorder them in terms of best fit.
Yep.
* Scheduler takes the host at the top of that ranked list, and triesto claim the resources in placement. If that fails, there is a race,so that HostState is discarded, and the next is selected. This isrepeated until the claim succeeds.
No, this is not how things work currently. The scheduler does not claimresources. It selects the top (or random host depending on the selectionstrategy) and sends the launch request to the target compute node. Thetarget compute node then attempts to claim the resources and in doing sowrites records to the compute_nodes table in the Nova cell database aswell as the Placement API for the compute node resource provider.

Not to nit pick, but today the scheduler sends the selected destinationsto the conductor. Conductor looks up the cell that a selected host isin, creates the instance record and friends (bdms) in that cell and thensends the build request to the compute host in that cell.

* Scheduler then creates a list of N UUIDs, with the first being theselected host, and the the rest being alternates consisting of thenext hosts in the ranked list that are in the same cell as theselected host.
This isn't currently how things work, no. This has been discussed, however.
* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends thatlist to the target cell.* Target cell tries to build the instance on the selected host. If itfails, it unclaims the resources for the selected host, and tries toclaim the resources for the next host in the list. It then tries tobuild the instance on the next host in the list of alternates. Onlywhen all alternates fail does the build request fail.
This isn't currently how things work, no. There has been discussion ofhaving the compute node retry alternatives locally, but nothing morethan discussion.

Correct that this isn't how things currently work, but it was/is theoriginal plan. And the retry happens within the cell conductor, not onthe compute node itself. The top-level conductor is what's gettingselected hosts from the scheduler. The cell-level conductor is what'sgetting a retry request from the compute. The cell-level conductor woulddeallocate from placement for the currently claimed providers, and thenpick one of the alternatives passed down from the top and then makeallocations (a claim) against those, then send to an alternative computehost for another build attempt.

So with this plan, there are two places to make allocations - thescheduler first, and then the cell conductors for retries. Thisduplication is why some people were originally pushing to move allallocation-related work happen in the conductor service.

Proposed flow:
* Scheduler gets a req spec from conductor, containing resourcerequirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfythose requirements
Yes.
* Placement then constructs a data structure for each root provider asdocumented in the spec. [0]
Yes.
* Placement returns a number of these data structures as JSON blobs.Due to the size of the data, a page size will have to be determined,and placement will have to either maintain that list of structureddatafor subsequent requests, or re-run the query and only calculatethe data structures for the hosts that fit in the requested page.
"of these data structures as JSON blobs" is kind of redundant... all ourREST APIs return data structures as JSON blobs.
While we discussed the fact that there may be a lot of entries, we didnot say we'd immediately support a paging mechanism.

I believe we said in the initial version we'd have the configurablelimit in the DB API queries, like we have today - the default limit is1000. There was agreement to eventually build paging support into the API.

This does make me wonder though what happens when you have 100K or morecompute nodes reporting into placement and we limit on the first 1000.Aren't we going to be imposing a packing strategy then just because ofhow we pull things out of the database for Placement? Although I don'tsee how that would be any different from before we had Placement and thenova-scheduler service just did a ComputeNode.get_all() to the nova DBand then filtered/weighed those objects.

* Scheduler continues to request the paged results until it has them all.
See above. Was discussed briefly as a concern but not work to do forfirst patches.
* Scheduler then runs this data through the filters and weighers. NoHostState objects are required, as the data structures will containall the information that scheduler will need.
No, this isn't correct. The scheduler will have *some* of theinformation it requires for weighing from the returned data from the GET/allocation_candidates call, but not all of it.
Again, operators have insisted on keeping the flexibility currently inthe Nova scheduler to weigh/sort compute nodes by things like thermalmetrics and kinds of data that the Placement API will never beresponsible for.
The scheduler will need to merge information from the"provider_summaries" part of the HTTP response with information it hasalready in its HostState objects (gotten fromComputeNodeList.get_all_by_uuid() and AggregateMetadataList).
* Scheduler then selects the data structure at the top of the rankedlist. Inside that structure is a dict of the allocation data thatscheduler will need to claim the resources on the selected host. Ifthe claim fails, the next data structure in the list is chosen, andrepeated until a claim succeeds.
Kind of, yes. The scheduler will select a *host* that meets its needs.
There may be more than one allocation request that includes that hostresource provider, because of shared providers and (soon) nestedproviders. The scheduler will choose one of these allocation requestsand attempt a claim of resources by simply PUT/allocations/{instance_uuid} with the serialized body of that allocationrequest. If 202 returned, cool. If not, repeat for the next allocationrequest.
* Scheduler then creates a list of N of these data structures, withthe first being the data for the selected host, and the the rest beingdata structures representing alternates consisting of the next hostsin the ranked list that are in the same cell as the selected host.
Yes, this is the proposed solution for allowing retries within a cell.
* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends thatlist to the target cell.* Target cell tries to build the instance on the selected host. If itfails, it uses the allocation data in the data structure to unclaimthe resources for the selected host, and tries to claim the resourcesfor the next host in the list using its allocation data. It then triesto build the instance on the next host in the list of alternates. Onlywhen all alternates fail does the build request fail.
I'll let Dan discuss this last part.

Best,
-jay
[0] https://review.openstack.org/#/c/471927/
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

Reply via email to