All- Based on a (long) discussion yesterday [1] I have put up a patch [2] whereby you can set [compute]resource_provider_association_refresh to zero and the resource tracker will never* refresh the report client's provider cache. Philosophically, we're removing the "healing" aspect of the resource tracker's periodic and trusting that placement won't diverge from whatever's in our cache. (If it does, it's because the op hit the CLI, in which case they should SIGHUP - see below.)
*except: - When we initially create the compute node record and bootstrap its resource provider. - When the virt driver's update_provider_tree makes a change, update_from_provider_tree reflects them in the cache as well as pushing them back to placement. - If update_from_provider_tree fails, the cache is cleared and gets rebuilt on the next periodic. - If you send SIGHUP to the compute process, the cache is cleared. This should dramatically reduce the number of calls to placement from the compute service. Like, to nearly zero, unless something is actually changing. Can I get some initial feedback as to whether this is worth polishing up into something real? (It will probably need a bp/spec if so.) [1] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03 [2] https://review.openstack.org/#/c/614886/ ========== Background ========== In the Queens release, our friends at CERN noticed a serious spike in the number of requests to placement from compute nodes, even in a stable-state cloud. Given that we were in the process of adding a ton of infrastructure to support sharing and nested providers, this was not unexpected. Roughly, what was previously: @periodic_task: GET /resource_providers/$compute_uuid GET /resource_providers/$compute_uuid/inventories became more like: @periodic_task: # In Queens/Rocky, this would still just return the compute RP GET /resource_providers?in_tree=$compute_uuid # In Queens/Rocky, this would return nothing GET /resource_providers?member_of=...&required=MISC_SHARES... for each provider returned above: # i.e. just one in Q/R GET /resource_providers/$compute_uuid/inventories GET /resource_providers/$compute_uuid/traits GET /resource_providers/$compute_uuid/aggregates In a cloud the size of CERN's, the load wasn't acceptable. But at the time, CERN worked around the problem by disabling refreshing entirely. (The fact that this seems to have worked for them is an encouraging sign for the proposed code change.) We're not actually making use of most of that information, but it sets the stage for things that we're working on in Stein and beyond, like multiple VGPU types, bandwidth resource providers, accelerators, NUMA, etc., so removing/reducing the amount of information we look at isn't really an option strategically. _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators