Re: [openstack-dev] [nova] Update on scheduler and resource tracker progress

Ryan Rossiter Fri, 12 Feb 2016 08:12:06 -0800

> On Feb 11, 2016, at 2:24 PM, Jay Pipes <[email protected]> wrote:
> 
> Hello all,
> 
> Performance working group, please pay attention to Chapter 2 in the details 
> section.
> 
> tl;dr
> -----
> 
> At the Nova mid-cycle, we finalized decisions on a way forward in redesigning 
> the way that resources are tracked in Nova. This work is a major undertaking 
> and has implications for splitting out the scheduler from Nova, for the 
> ability of the placement engine to scale, and for removing long-standing 
> reporting and race condition bugs that have plagued Nova for years.
> 
> The following blueprint specifications outline the effort, which we are 
> calling the "resource providers framework":
> 
> * resource-classes (bp MERGED, code MERGED)
> * pci-generate-stats (bp MERGED, code IN REVIEW)
> * resource-providers (bp MERGED, code IN REVIEW)
> * generic-resource-pools (bp IN REVIEW, code TODO)
> * compute-node-inventory (bp IN REVIEW, code TODO)
> * resource-providers-allocations (bp IN REVIEW, code TODO)
> * resource-providers-scheduler (bp IN REVIEW, code TODO)
> 
> The group working on this code and doing the reviews are hopeful that the 
> generic-resource-pools work can be completed in Mitaka, and we also are going 
> to aim to get the compute-node-inventory work done in Mitaka, though that 
> will be more of a stretch.
> 
> The remainder of the resource providers framework blueprints will be targeted 
> to Newton. The resource-providers-scheduler blueprint is the final blueprint 
> required before the scheduler can be fully separated from Nova.
> 
> details
> -------
> 
> Chapter 1 - How the blueprints fit together
> ===========================================
> 
> A request to launch an instance in Nova involves requests for two different 
> things: *resources* and *capabilities*. Resources are the quantitative part 
> of the request spec. Capabilities are the qualitative part of the request.
> 
> The *resource providers framework* is a set of 7 blueprints that reorganize 
> the way that Nova handles the quantitative side of the equation. These 7 
> blueprints are described below.
> 
> Compute nodes are a type of *resource provider*, since they allow instances 
> to *consume* some portion of its *inventory* of various types of resources. 
> We call these types of resources *"resource classes"*.
> 
> resource-classes bp: https://review.openstack.org/256297
> 
> The resource-providers blueprint introduces a new set of tables for storing 
> capacity and usage amounts of all resources in the system:
> 
> resource-providers bp: https://review.openstack.org/225546
> 
> While all compute nodes are resource providers [1], not all resource 
> providers are compute nodes. *Generic resource pools* are resource providers 
> that have an inventory of a *single resource class* and that provide that 
> resource class to consumers that are placed on multiple compute nodes.
> 
> The canonical example of a generic resource pool is a shared storage system. 
> Currently, a Nova compute node doesn't really know whether the storage 
> location it uses for storing disk images is a shared drive/cluster (ala NFS 
> or RBD) or if the storage location is a local disk drive [2]. The 
> generic-resource-pools blueprint covers the addition of these generic 
> resource pools, their relation to host aggregates, and the RESTful API [3] 
> added to control this external resource pool information.
> 
> generic-resource-pools bp: https://review.openstack.org/253187
> 
> Within the Nova database schemas [4], capacity and inventory information is 
> stored in a variety of tables, columns and formats. vCPU, RAM and DISK 
> capacity information is stored in integer fields, PCI capacity information is 
> stored in the pci_devices table, NUMA inventory is stored combined together 
> with usage information in a JSON blob, etc. The compute-node-inventory 
> blueprint migrates all of the disparate capacity information from 
> compute_nodes into the new inventory table.
> 
> compute-node-inventory bp: https://review.openstack.org/260048
> 
> For the PCI resource classes, Nova currently has an entirely different 
> resource tracker (in /nova/pci/*) that stores an aggregate view of the PCI 
> resources (grouped by product, vendor, and numa node) in the 
> compute_nodes.pci_stats field. This information is entirely redundant 
> information since all fine-grained PCI resource information is stored in the 
> pci_devices table. This storage of summary information presents a sync 
> problem. The pci-generate-stats blueprint describes the effort to remove this 
> storage of summary device pool information and instead generate this summary 
> information on the fly for the scheduler. This work is a pre-requisite to 
> having all resource classes managed in a unified manner in Nova:
> 
> pci-generate-stats bp: https://review.openstack.org/240852
> 
> In the same way that capacity fields are scattered among different tables, 
> columns and formats, so too are the fields that store usage information. Some 
> fields are in the instances table, some in the instance_extra table, some 
> information is derived from the pci_devices table, other bits from a JSON 
> blob field. In short, it's an inconsistent mess. This mess means adding 
> support for adding additional types of resources typically involves adding 
> yet more inconsistency and conditional logic into the scheduler and 
> nova-compute's resource tracker. The resource-providers-allocations blueprint 
> involves work to migrate all usage record information out of the disparate 
> fields in the current schema and into the allocations table introduced in the 
> resource-providers blueprint:
> 
> resource-providers-allocations bp: https://review.openstack.org/271779
> 
> Once all of the inventory (capacity) and allocation (usage) information has 
> been migrated to the database schema described in the resource-providers 
> blueprint, Nova will be treating all types of resources in a generic fashion. 
> The next step is to modify the scheduler to take advantage of this new 
> resource representation. The resource-providers-scheduler blueprint 
> undertakes this important step:
> 
> resource-providers-scheduler bp: https://review.openstack.org/271823
> 
> Chapter 2 - Addressing performance and scale
> ============================================
> 
> One of the significant performance problems with the Nova scheduler is the 
> fact that for every call to the select_destinations() RPC API method -- which 
> itself is called at least once every time a launch or migration request is 
> made -- the scheduler grabs all records for all compute nodes in the 
> deployment. Once retrieving all these compute node records, the scheduler 
> runs each through a set of filters to determine which compute nodes have the 
> required capacity to service the instance's requested resources. Having the 
> scheduler continually retrieve every compute node record on each request to 
> select_destinations() is extremely inefficient. The greater the number of 
> compute nodes, the bigger the performance and scale problem this becomes.
> 
> On a loaded cloud deployment -- say there are 1000 compute nodes and 900 of 
> them are fully loaded with active virtual machines -- the scheduler is still 
> going to retrieve all 1000 compute node records on every request to 
> select_destinations() and process each one of those records through all 
> scheduler filters. Clearly, if we could filter the amount of compute node 
> records that are returned by removing those nodes that do not have available 
> capacity, we could dramatically reduce the amount of work that each call to 
> select_destinations() would need to perform.
> 
> The resource-providers-scheduler blueprint attempts to address the above 
> problem by replacing a number of the scheduler filters that currently run 
> *after* the database has returned all compute node records with instead a 
> series of WHERE clauses and join conditions on the database query. The idea 
> here is to winnow the number of returned compute node results as much as 
> possible. The fewer records the scheduler must post-process, the faster the 
> performance of each individual call to select_destinations().
> 
> The second major scale problem with the current Nova scheduler design has to 
> do with the fact that the scheduler does *not* actually claim resources on a 
> provider. Instead, the scheduler selects a destination host to place the 
> instance on and the Nova conductor then sends a message to that target host 
> which attempts to spawn the instance on its hypervisor. If the spawn 
> succeeds, the target compute host updates the Nova database and decrements 
> its count of available resources. These steps (from nova-scheduler to 
> nova-conductor to nova-compute to database) all take some not insignificant 
> amount of time. During this time window, a different scheduler process may 
> pick the exact same target host for a like-sized launch request. If there is 
> only room on the target host for one of those size requests [5], one of those 
> spawn requests will fail and trigger a retry operation. This retry operation 
> will attempt to repeat the scheduler placement decisions (by calling 
> select_destinations()).
> 
> This retry operation is relatively expensive and needlessly so: if the 
> scheduler claimed the resources on the target host before sending its pick 
> back to the scheduler, then the chances of producing a retry will be almost 
> eliminated [6]. The resource-providers-scheduler blueprint attempts to remedy 
> this second scaling design problem by having the scheduler write records to 
> the allocations table before sending the selected target host back to the 
> Nova conductor.
> 
> Conclusions
> ===========
> 
> Thanks if you've made it this far in this epic email. :) If you have 
> questions about the plans, please do feel free to respond here or come find 
> us on Freenode #openstack-nova IRC. Your reviews and comments are also very 
> welcome on the specs and patches.
> 
> Best,
> -jay
> 
> [1] One might argue that nova-compute daemons that proxy for some other 
> resource manager like vCenter or Ironic are not actually resource providers, 
> but just go with me on this one...
> 
> [2] This results in a number of resource reporting bugs, including Nova 
> reporting that the deployment has X times as much disk capacity as it really 
> does (where X is the number of compute nodes sharing the same storage 
> location).
> 
> [3] The RESTful API in the generic-resource-pools blueprint actually will be 
> a completely new REST endpoint and service (/placement) that will be the 
> start of the new extracted schd
> 
> [4] Nova has two database schemas. The first is what is known as the Child 
> Cell database and contains the majority of database tables. The second is 
> known as the API database and contains global and top-level routing tables.
> 
> [5] This situation is more common than you might originally think. Any cloud 
> that runs a pack-first placement strategy with multiple scheduler daemon 
> processes will suffer from this problem.
> 
> [6] Technically, it cannot be eliminated because an out-of-band operation 
> could theoretically occur (for example, an administrator could manually -- 
> not through Nova -- launch a virtual machine on the target host) and 
> therefore introduce some unaccounted-for amount of used resources for a small 
> window of time in between the periodic interval by which the nova-compute 
> runs an audit task.
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


Seeing objects changes come in for this has made me feel like I should be 
helping out with the review load. But without being at the midcycle, it felt 
like I didn’t know what was going on with these because “you had to be there”. 
This summary helps me follow the “why” behind these changes, and the well 
structured explanation helped me figure out the ordering/purpose of the blob of 
specs that went in. Though I’m guessing I’ll still have a bunch of questions on 
this stuff when I’m reviewing it, I at least know more than I did before. 
Thanks a bunch for this Jay!

-----
Thanks,

Ryan Rossiter (rlrossit)


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Update on scheduler and resource tracker progress

Reply via email to