On Mon, 2017-06-19 at 09:36 -0500, Matt Riedemann wrote: > On 6/19/2017 9:17 AM, Jay Pipes wrote: > > On 06/19/2017 09:04 AM, Edward Leafe wrote: > > > Current flow: > > As noted in the nova-scheduler meeting this morning, this should have > been called "original plan" rather than "current flow", as Jay pointed > out inline. > > > > * Scheduler gets a req spec from conductor, containing resource > > > requirements > > > * Scheduler sends those requirements to placement > > > * Placement runs a query to determine the root RPs that can satisfy > > > those requirements > > > > Not root RPs. Non-sharing resource providers, which currently > > effectively means compute node providers. Nested resource providers > > isn't yet merged, so there is currently no concept of a hierarchy of > > providers. > > > > > * Placement returns a list of the UUIDs for those root providers to > > > scheduler > > > > It returns the provider names and UUIDs, yes. > > > > > * Scheduler uses those UUIDs to create HostState objects for each > > > > Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing > > in a list of the provider UUIDs it got back from the placement service. > > The scheduler then builds a set of HostState objects from the results of > > ComputeNodeList.get_all_by_uuid(). > > > > The scheduler also keeps a set of AggregateMetadata objects in memory, > > including the association of aggregate to host (note: this is the > > compute node's *service*, not the compute node object itself, thus the > > reason aggregates don't work properly for Ironic nodes). > > > > > * Scheduler runs those HostState objects through filters to remove > > > those that don't meet requirements not selected for by placement > > > > Yep. > > > > > * Scheduler runs the remaining HostState objects through weighers to > > > order them in terms of best fit. > > > > Yep. > > > > > * Scheduler takes the host at the top of that ranked list, and tries > > > to claim the resources in placement. If that fails, there is a race, > > > so that HostState is discarded, and the next is selected. This is > > > repeated until the claim succeeds. > > > > No, this is not how things work currently. The scheduler does not claim > > resources. It selects the top (or random host depending on the selection > > strategy) and sends the launch request to the target compute node. The > > target compute node then attempts to claim the resources and in doing so > > writes records to the compute_nodes table in the Nova cell database as > > well as the Placement API for the compute node resource provider. > > Not to nit pick, but today the scheduler sends the selected destinations > to the conductor. Conductor looks up the cell that a selected host is > in, creates the instance record and friends (bdms) in that cell and then > sends the build request to the compute host in that cell. > > > > > > * Scheduler then creates a list of N UUIDs, with the first being the > > > selected host, and the the rest being alternates consisting of the > > > next hosts in the ranked list that are in the same cell as the > > > selected host. > > > > This isn't currently how things work, no. This has been discussed, however. > > > > > * Scheduler returns that list to conductor. > > > * Conductor determines the cell of the selected host, and sends that > > > list to the target cell. > > > * Target cell tries to build the instance on the selected host. If it > > > fails, it unclaims the resources for the selected host, and tries to > > > claim the resources for the next host in the list. It then tries to > > > build the instance on the next host in the list of alternates. Only > > > when all alternates fail does the build request fail. > > > > This isn't currently how things work, no. There has been discussion of > > having the compute node retry alternatives locally, but nothing more > > than discussion. > > Correct that this isn't how things currently work, but it was/is the > original plan. And the retry happens within the cell conductor, not on > the compute node itself. The top-level conductor is what's getting > selected hosts from the scheduler. The cell-level conductor is what's > getting a retry request from the compute. The cell-level conductor would > deallocate from placement for the currently claimed providers, and then > pick one of the alternatives passed down from the top and then make > allocations (a claim) against those, then send to an alternative compute > host for another build attempt. > > So with this plan, there are two places to make allocations - the > scheduler first, and then the cell conductors for retries. This > duplication is why some people were originally pushing to move all > allocation-related work happen in the conductor service. > > > > Proposed flow: > > > * Scheduler gets a req spec from conductor, containing resource > > > requirements > > > * Scheduler sends those requirements to placement > > > * Placement runs a query to determine the root RPs that can satisfy > > > those requirements > > > > Yes. > > > > > * Placement then constructs a data structure for each root provider as > > > documented in the spec. [0] > > > > Yes. > > > > > * Placement returns a number of these data structures as JSON blobs. > > > Due to the size of the data, a page size will have to be determined, > > > and placement will have to either maintain that list of structured > > > datafor subsequent requests, or re-run the query and only calculate > > > the data structures for the hosts that fit in the requested page. > > > > "of these data structures as JSON blobs" is kind of redundant... all our > > REST APIs return data structures as JSON blobs. > > > > While we discussed the fact that there may be a lot of entries, we did > > not say we'd immediately support a paging mechanism. > > I believe we said in the initial version we'd have the configurable > limit in the DB API queries, like we have today - the default limit is > 1000. There was agreement to eventually build paging support into the API. > > This does make me wonder though what happens when you have 100K or more > compute nodes reporting into placement and we limit on the first 1000. > Aren't we going to be imposing a packing strategy then just because of > how we pull things out of the database for Placement? Although I don't > see how that would be any different from before we had Placement and the > nova-scheduler service just did a ComputeNode.get_all() to the nova DB > and then filtered/weighed those objects. > > > > * Scheduler continues to request the paged results until it has them all. > > > > See above. Was discussed briefly as a concern but not work to do for > > first patches. > > > > > * Scheduler then runs this data through the filters and weighers. No > > > HostState objects are required, as the data structures will contain > > > all the information that scheduler will need. > > > > No, this isn't correct. The scheduler will have *some* of the > > information it requires for weighing from the returned data from the GET > > /allocation_candidates call, but not all of it. > > > > Again, operators have insisted on keeping the flexibility currently in > > the Nova scheduler to weigh/sort compute nodes by things like thermal > > metrics and kinds of data that the Placement API will never be > > responsible for. > > > > The scheduler will need to merge information from the > > "provider_summaries" part of the HTTP response with information it has > > already in its HostState objects (gotten from > > ComputeNodeList.get_all_by_uuid() and AggregateMetadataList). > > > > > * Scheduler then selects the data structure at the top of the ranked > > > list. Inside that structure is a dict of the allocation data that > > > scheduler will need to claim the resources on the selected host. If > > > the claim fails, the next data structure in the list is chosen, and > > > repeated until a claim succeeds. > > > > Kind of, yes. The scheduler will select a *host* that meets its needs. > > > > There may be more than one allocation request that includes that host > > resource provider, because of shared providers and (soon) nested > > providers. The scheduler will choose one of these allocation requests > > and attempt a claim of resources by simply PUT > > /allocations/{instance_uuid} with the serialized body of that allocation > > request. If 202 returned, cool. If not, repeat for the next allocation > > request. > > > > > * Scheduler then creates a list of N of these data structures, with > > > the first being the data for the selected host, and the the rest being > > > data structures representing alternates consisting of the next hosts > > > in the ranked list that are in the same cell as the selected host. > > > > Yes, this is the proposed solution for allowing retries within a cell. > > > > > * Scheduler returns that list to conductor. > > > * Conductor determines the cell of the selected host, and sends that > > > list to the target cell. > > > * Target cell tries to build the instance on the selected host. If it > > > fails, it uses the allocation data in the data structure to unclaim > > > the resources for the selected host, and tries to claim the resources > > > for the next host in the list using its allocation data. It then tries > > > to build the instance on the next host in the list of alternates. Only > > > when all alternates fail does the build request fail. > > > > I'll let Dan discuss this last part. > > > > Best, > > -jay > > > > > [0] https://review.openstack.org/#/c/471927/
I have a document (with a nifty activity diagram in tow) for all the above available here: https://review.openstack.org/475810 Should be more Google'able that mailing list posts for future us :) Stephen __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev