There is now a blueprint [1] and draft spec [2]. Reviews welcomed. [1] https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree [2] https://review.openstack.org/#/c/572583/
On 06/04/2018 06:00 PM, Eric Fried wrote: > There has been much discussion. We've gotten to a point of an initial > proposal and are ready for more (hopefully smaller, hopefully > conclusive) discussion. > > To that end, there will be a HANGOUT tomorrow (TUESDAY, JUNE 5TH) at > 1500 UTC. Be in #openstack-placement to get the link to join. > > The strawpeople outlined below and discussed in the referenced etherpad > have been consolidated/distilled into a new etherpad [1] around which > the hangout discussion will be centered. > > [1] https://etherpad.openstack.org/p/placement-making-the-(up)grade > > Thanks, > efried > > On 06/01/2018 01:12 PM, Jay Pipes wrote: >> On 05/31/2018 02:26 PM, Eric Fried wrote: >>>> 1. Make everything perform the pivot on compute node start (which can be >>>> re-used by a CLI tool for the offline case) >>>> 2. Make everything default to non-nested inventory at first, and provide >>>> a way to migrate a compute node and its instances one at a time (in >>>> place) to roll through. >>> >>> I agree that it sure would be nice to do ^ rather than requiring the >>> "slide puzzle" thing. >>> >>> But how would this be accomplished, in light of the current "separation >>> of responsibilities" drawn at the virt driver interface, whereby the >>> virt driver isn't supposed to talk to placement directly, or know >>> anything about allocations? >> FWIW, I don't have a problem with the virt driver "knowing about >> allocations". What I have a problem with is the virt driver *claiming >> resources for an instance*. >> >> That's what the whole placement claims resources things was all about, >> and I'm not interested in stepping back to the days of long racy claim >> operations by having the compute nodes be responsible for claiming >> resources. >> >> That said, once the consumer generation microversion lands [1], it >> should be possible to *safely* modify an allocation set for a consumer >> (instance) and move allocation records for an instance from one provider >> to another. >> >> [1] https://review.openstack.org/#/c/565604/ >> >>> Here's a first pass: >>> >>> The virt driver, via the return value from update_provider_tree, tells >>> the resource tracker that "inventory of resource class A on provider B >>> have moved to provider C" for all applicable AxBxC. E.g. >>> >>> [ { 'from_resource_provider': <cn_rp_uuid>, >>> 'moved_resources': [VGPU: 4], >>> 'to_resource_provider': <gpu_rp1_uuid> >>> }, >>> { 'from_resource_provider': <cn_rp_uuid>, >>> 'moved_resources': [VGPU: 4], >>> 'to_resource_provider': <gpu_rp2_uuid> >>> }, >>> { 'from_resource_provider': <cn_rp_uuid>, >>> 'moved_resources': [ >>> SRIOV_NET_VF: 2, >>> NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000, >>> NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000, >>> ], >>> 'to_resource_provider': <gpu_rp2_uuid> >>> } >>> ] >>> >>> As today, the resource tracker takes the updated provider tree and >>> invokes [1] the report client method update_from_provider_tree [2] to >>> flush the changes to placement. But now update_from_provider_tree also >>> accepts the return value from update_provider_tree and, for each "move": >>> >>> - Creates provider C (as described in the provider_tree) if it doesn't >>> already exist. >>> - Creates/updates provider C's inventory as described in the >>> provider_tree (without yet updating provider B's inventory). This ought >>> to create the inventory of resource class A on provider C. >> >> Unfortunately, right here you'll introduce a race condition. As soon as >> this operation completes, the scheduler will have the ability to throw >> new instances on provider C and consume the inventory from it that you >> intend to give to the existing instance that is consuming from provider B. >> >>> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*. >> >> For each consumer of resources on rp B, right? >> >>> - Updates provider B's inventory. >> >> Again, this is problematic because the scheduler will have already begun >> to place new instances on B's inventory, which could very well result in >> incorrect resource accounting on the node. >> >> We basically need to have one giant new REST API call that accepts the >> list of "move instructions" and performs all of the instructions in a >> single transaction. :( >> >>> (*There's a hole here: if we're splitting a glommed-together inventory >>> across multiple new child providers, as the VGPUs in the example, we >>> don't know which allocations to put where. The virt driver should know >>> which instances own which specific inventory units, and would be able to >>> report that info within the data structure. That's getting kinda close >>> to the virt driver mucking with allocations, but maybe it fits well >>> enough into this model to be acceptable?) >> >> Well, it's not really the virt driver *itself* mucking with the >> allocations. It's more that the virt driver is telling something *else* >> the move instructions that it feels are needed... >> >>> Note that the return value from update_provider_tree is optional, and >>> only used when the virt driver is indicating a "move" of this ilk. If >>> it's None/[] then the RT/update_from_provider_tree flow is the same as >>> it is today. >>> >>> If we can do it this way, we don't need a migration tool. In fact, we >>> don't even need to restrict provider tree "reshaping" to release >>> boundaries. As long as the virt driver understands its own data model >>> migrations and reports them properly via update_provider_tree, it can >>> shuffle its tree around whenever it wants. >> >> Due to the many race conditions we would have in trying to fudge >> inventory amounts (the reserved/total thing) and allocation movement for >>> 1 consumer at a time, I'm pretty sure the only safe thing to do is have >> a single new HTTP endpoint that would take this list of move operations >> and perform them atomically (on the placement server side of course). >> >> Here's a strawman for how that HTTP endpoint might look like: >> >> https://etherpad.openstack.org/p/placement-migrate-operations >> >> feel free to markup and destroy. >> >> Best, >> -jay >> >>> Thoughts? >>> >>> -efried >>> >>> [1] >>> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890 >>> >>> [2] >>> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341 >>> >>> >>> __________________________________________________________________________ >>> >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
