Re: [openstack-dev] [nova] [placement] placement api request analysis

2017-01-30 Thread Chris Dent

On Thu, 26 Jan 2017, Chris Dent wrote:

On Wed, 25 Jan 2017, Chris Dent wrote:

#B3
The new GET to /placement/allocations is happening when the
resource tracker calls _update_usage_from_instance, which is always
being called becuause is_new_instance is always true in that method,
even when the instance is not "new". This is happening because the
tracked_instaces dict is _always_ getting cleared before
_update_usage_from_instance is being called. Which is weird because
it appears that it is that method's job to update tracked_instances.
If I remove the clear() the get on /placement/allocations goes away.
But I'm not sure what else this will break. The addition of that line
was a long time ago, in this change (I think):
https://review.openstack.org/#/c/13182/


I made a bug about this:

   https://bugs.launchpad.net/nova/+bug/1659647

and have the gate looking at what breaks if the clear goes away:

   https://review.openstack.org/#/c/425885/


Nothing broke, but discussion in IRC[1] suggests that the clearing
of tracked_instances is effectively a safety valve for those cases
where events which are supposed to change the state of an instance
somehow get lost or incorrect recorded. By flushing
tracked_instances are more complete accounting is performed.

This is something that ought to be fixed, but will require more
focused testing so presumably is a "later". We should figure it out,
though, because it is responsible for much of the traffic related to
checking allocations.

Meanwhile, the fix to comparing old and new compute node objects[2]
has merged. This removes 3 repeated (assuming no other changes) per
periodic job.

That means the current calculation for number of requests per
periodic job is:

  The requests done via _init_compute_node:
  GET aggregates to update local aggregates map1
  GET inventories to compare with current inventory1
  Calls from _update_usage_from_instances:
remove_deleted_instances
  GET all the allocations for this resource provide1
_update_usage_from_instance
  GET allocations for consumer uuid1 per instance

3 + .

We can change this by:

* adding more smarts in _init_compute_node, but this impacts both
  our concept of "self-healing" inventory and the ability to
  dynamically manage aggregate associations

* adding more smarts with how tracked_instances is cleared or at
  least how the instances being tracked impacts when or how often a
  get allocations for consumer uuid is called

[1] Conversation between melwitt, cfriesen, superdan, me:
http://p.anticdent.org/3bbY

[2] https://review.openstack.org/#/c/424305/

--
Chris Dent ¯\_(ツ)_/¯   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] placement api request analysis

2017-01-26 Thread Chris Dent

On Wed, 25 Jan 2017, Chris Dent wrote:


#B3
The new GET to /placement/allocations is happening when the
resource tracker calls _update_usage_from_instance, which is always
being called becuause is_new_instance is always true in that method,
even when the instance is not "new". This is happening because the
tracked_instaces dict is _always_ getting cleared before
_update_usage_from_instance is being called. Which is weird because
it appears that it is that method's job to update tracked_instances.
If I remove the clear() the get on /placement/allocations goes away.
But I'm not sure what else this will break. The addition of that line
was a long time ago, in this change (I think):
https://review.openstack.org/#/c/13182/


I made a bug about this:

https://bugs.launchpad.net/nova/+bug/1659647

and have the gate looking at what breaks if the clear goes away:

https://review.openstack.org/#/c/425885/

--
Chris Dent ¯\_(ツ)_/¯   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] [placement] placement api request analysis

2017-01-25 Thread Chris Dent


I've started looking into what kind of request load the placement
API can expect when both the scheduler and the resource tracker are
talking to it. I think this is important to do now before we have
things widely relying on this stuff so we can give some reasonable
advice on deployment options and expected traffic.

I'm working with a single node devstack, which should make the math
nice and easy.

Unfortunately when doing this is really ended more of an audit of
where the resource tracker is doing more than it ought to be. What
follows ends being a rambling exploration of areas that _may_ be
wrong.

I've marked paragraphs that have things that maybe ought to change
with #B. It appears that the resource tracker is doing
a lot of extra work that it doesn't need to do (even before the
advent of the placement API). There's already one fix in progress
(for B2) but the others need some discussion as I'm not sure of the
ramifications. I'd like some help deciding what's going on before I
make random bug reports.

Before Servers
==

When the compute node starts it makes two requests to create the
resource provider that represents that compute, at which point it
also requests the aggregates for that resource provider, to update
its local map of aggregate associations.

#B0
It then updates inventory for the resource provider, twice, the
first one is a conflict (probably because the generation is out of
wack[1]).

After that every 60s or so, five requests are made:

GET 
/placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/aggregates"
GET 
/placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/inventories
GET 
/placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/allocations
GET 
/placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/aggregates
GET 
/placement/resource_providers/0e33c6f5-62f3-4522-8f95-39b364aa02b4/inventories"

These requests are returning the same data each time (so far).

The request to get aggregates happens twice on every cycle, because
it happens each time we ensure the resource provider is present in
our local map of resource providers. Aggregates are checked each time
because if we don't there's no other clean way for an operator to
associate aggregates and have them quickly picked up.

The request to inventories is checking if inventory has
changed. This is happening as a result of the regular call to
'update_available_resource' passing through _update method.

#B1
That same method is also calling _init_compute_node, which will
_also_ think about updating the inventory and thus do the aggregates
check from _ensure_resource_provider. That seems redundant. Perhaps
we should only call update_resource_stats from _update and not from
_init_compute_node as they are both called from the same method in
the resource tracker.

That same method also reguarly calls '_update_usage_from_instances'
which calls 'remove_deleted_instances' with a potentially empty list
of instances[2]. That method gets the allocations for this compute
node.

So before we've added any VMs we're at 5000 requests per minute in a
1000 node cluster.

#B2
Adding in the fix at https://review.openstack.org/#/c/424305/
reduces a lot of that churn by avoiding an update from _update when
not necessary, reducing to three requests every 60s when there are
no servers. The remaining requests are from the call to
_init_compute_node at #B1 above.

Creating a Server
=

When we create a server there are seven total requests, with these
involved with the actual instance:

GET 
/placement/resource_providers?resources=VCPU%3A1%2CMEMORY_MB%3A512%2CDISK_GB%3A1
GET /placement/allocations/717b8dcc-110c-4914-b9c1-c04433267577
PUT /placement/allocations/717b8dcc-110c-4914-b9c1-c04433267577

(allocations are done by comparing with what's there, if anything)

The others are what _update does.

After that the three requests grows to four per 60s:

GET 
/placement/resource_providers/8635a519-eac8-43b2-9bf0-aba848b328a7/aggregates
GET 
/placement/resource_providers/8635a519-eac8-43b2-9bf0-aba848b328a7/inventories
GET /placement/allocations/c4b73292-3731-4f25-b102-1bd176f4bd9b
GET 
/placement/resource_providers/8635a519-eac8-43b2-9bf0-aba848b328a7/allocations

#B3
The new GET to /placement/allocations is happening when the
resource tracker calls _update_usage_from_instance, which is always
being called becuause is_new_instance is always true in that method,
even when the instance is not "new". This is happening because the
tracked_instaces dict is _always_ getting cleared before
_update_usage_from_instance is being called. Which is weird because
it appears that it is that method's job to update tracked_instances.
If I remove the clear() the get on /placement/allocations goes away.
But I'm not sure what else this will break. The addition of that line
was a long time ago, in this change (I think):
https://review.openstack.org/#/c/13182/

With the clear() gone the calls in