Re: [openstack-dev] [nova] [placement] placement api request analysis
On Thu, 26 Jan 2017, Chris Dent wrote: On Wed, 25 Jan 2017, Chris Dent wrote: #B3 The new GET to /placement/allocations is happening when the resource tracker calls _update_usage_from_instance, which is always being called becuause is_new_instance is always true in that method, even when the instance is not "new". This is happening because the tracked_instaces dict is _always_ getting cleared before _update_usage_from_instance is being called. Which is weird because it appears that it is that method's job to update tracked_instances. If I remove the clear() the get on /placement/allocations goes away. But I'm not sure what else this will break. The addition of that line was a long time ago, in this change (I think): https://review.openstack.org/#/c/13182/ I made a bug about this: https://bugs.launchpad.net/nova/+bug/1659647 and have the gate looking at what breaks if the clear goes away: https://review.openstack.org/#/c/425885/ Nothing broke, but discussion in IRC[1] suggests that the clearing of tracked_instances is effectively a safety valve for those cases where events which are supposed to change the state of an instance somehow get lost or incorrect recorded. By flushing tracked_instances are more complete accounting is performed. This is something that ought to be fixed, but will require more focused testing so presumably is a "later". We should figure it out, though, because it is responsible for much of the traffic related to checking allocations. Meanwhile, the fix to comparing old and new compute node objects[2] has merged. This removes 3 repeated (assuming no other changes) per periodic job. That means the current calculation for number of requests per periodic job is: The requests done via _init_compute_node: GET aggregates to update local aggregates map1 GET inventories to compare with current inventory1 Calls from _update_usage_from_instances: remove_deleted_instances GET all the allocations for this resource provide1 _update_usage_from_instance GET allocations for consumer uuid1 per instance 3 + . We can change this by: * adding more smarts in _init_compute_node, but this impacts both our concept of "self-healing" inventory and the ability to dynamically manage aggregate associations * adding more smarts with how tracked_instances is cleared or at least how the instances being tracked impacts when or how often a get allocations for consumer uuid is called [1] Conversation between melwitt, cfriesen, superdan, me: http://p.anticdent.org/3bbY [2] https://review.openstack.org/#/c/424305/ -- Chris Dent ¯\_(ツ)_/¯ https://anticdent.org/ freenode: cdent tw: @anticdent__ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] [placement] placement api request analysis
On Wed, 25 Jan 2017, Chris Dent wrote: #B3 The new GET to /placement/allocations is happening when the resource tracker calls _update_usage_from_instance, which is always being called becuause is_new_instance is always true in that method, even when the instance is not "new". This is happening because the tracked_instaces dict is _always_ getting cleared before _update_usage_from_instance is being called. Which is weird because it appears that it is that method's job to update tracked_instances. If I remove the clear() the get on /placement/allocations goes away. But I'm not sure what else this will break. The addition of that line was a long time ago, in this change (I think): https://review.openstack.org/#/c/13182/ I made a bug about this: https://bugs.launchpad.net/nova/+bug/1659647 and have the gate looking at what breaks if the clear goes away: https://review.openstack.org/#/c/425885/ -- Chris Dent ¯\_(ツ)_/¯ https://anticdent.org/ freenode: cdent tw: @anticdent__ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] [placement] placement api request analysis
I've started looking into what kind of request load the placement API can expect when both the scheduler and the resource tracker are talking to it. I think this is important to do now before we have things widely relying on this stuff so we can give some reasonable advice on deployment options and expected traffic. I'm working with a single node devstack, which should make the math nice and easy. Unfortunately when doing this is really ended more of an audit of where the resource tracker is doing more than it ought to be. What follows ends being a rambling exploration of areas that _may_ be wrong. I've marked paragraphs that have things that maybe ought to change with #B. It appears that the resource tracker is doing a lot of extra work that it doesn't need to do (even before the advent of the placement API). There's already one fix in progress (for B2) but the others need some discussion as I'm not sure of the ramifications. I'd like some help deciding what's going on before I make random bug reports. Before Servers == When the compute node starts it makes two requests to create the resource provider that represents that compute, at which point it also requests the aggregates for that resource provider, to update its local map of aggregate associations. #B0 It then updates inventory for the resource provider, twice, the first one is a conflict (probably because the generation is out of wack[1]). After that every 60s or so, five requests are made: GET /placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/aggregates" GET /placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/inventories GET /placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/allocations GET /placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/aggregates GET /placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/inventories" These requests are returning the same data each time (so far). The request to get aggregates happens twice on every cycle, because it happens each time we ensure the resource provider is present in our local map of resource providers. Aggregates are checked each time because if we don't there's no other clean way for an operator to associate aggregates and have them quickly picked up. The request to inventories is checking if inventory has changed. This is happening as a result of the regular call to 'update_available_resource' passing through _update method. #B1 That same method is also calling _init_compute_node, which will _also_ think about updating the inventory and thus do the aggregates check from _ensure_resource_provider. That seems redundant. Perhaps we should only call update_resource_stats from _update and not from _init_compute_node as they are both called from the same method in the resource tracker. That same method also reguarly calls '_update_usage_from_instances' which calls 'remove_deleted_instances' with a potentially empty list of instances[2]. That method gets the allocations for this compute node. So before we've added any VMs we're at 5000 requests per minute in a 1000 node cluster. #B2 Adding in the fix at https://review.openstack.org/#/c/424305/ reduces a lot of that churn by avoiding an update from _update when not necessary, reducing to three requests every 60s when there are no servers. The remaining requests are from the call to _init_compute_node at #B1 above. Creating a Server = When we create a server there are seven total requests, with these involved with the actual instance: GET /placement/resource_providers?resources=VCPU%3A1%2CMEMORY_MB%3A512%2CDISK_GB%3A1 GET /placement/allocations/717b8dcc-110c-4914-b9c1-c04433267577 PUT /placement/allocations/717b8dcc-110c-4914-b9c1-c04433267577 (allocations are done by comparing with what's there, if anything) The others are what _update does. After that the three requests grows to four per 60s: GET /placement/resource_providers/8635a519-eac8-43b2-9bf0-aba848b328a7/aggregates GET /placement/resource_providers/8635a519-eac8-43b2-9bf0-aba848b328a7/inventories GET /placement/allocations/c4b73292-3731-4f25-b102-1bd176f4bd9b GET /placement/resource_providers/8635a519-eac8-43b2-9bf0-aba848b328a7/allocations #B3 The new GET to /placement/allocations is happening when the resource tracker calls _update_usage_from_instance, which is always being called becuause is_new_instance is always true in that method, even when the instance is not "new". This is happening because the tracked_instaces dict is _always_ getting cleared before _update_usage_from_instance is being called. Which is weird because it appears that it is that method's job to update tracked_instances. If I remove the clear() the get on /placement/allocations goes away. But I'm not sure what else this will break. The addition of that line was a long time ago, in this change (I think): https://review.openstack.org/#/c/13182/ With the clear() gone the calls in