Re: [openstack-dev] [nova] heads up to users of Aggregate[Core|Ram|Disk]Filter: behavior change in >= Ocata

Jay Pipes Thu, 18 Jan 2018 14:21:43 -0800

On 01/18/2018 03:54 PM, Mathieu Gagné wrote:

Hi,


On Tue, Jan 16, 2018 at 4:24 PM, melanie witt <melwi...@gmail.com> wrote:

Hello Stackers,

This is a heads up to any of you using the AggregateCoreFilter,
AggregateRamFilter, and/or AggregateDiskFilter in the filter scheduler.
These filters have effectively allowed operators to set overcommit ratios
per aggregate rather than per compute node in <= Newton.

Beginning in Ocata, there is a behavior change where aggregate-based
overcommit ratios will no longer be honored during scheduling. Instead,
overcommit values must be set on a per compute node basis in nova.conf.

Details: as of Ocata, instead of considering all compute nodes at the start
of scheduler filtering, an optimization has been added to query resource
capacity from placement and prune the compute node list with the result
*before* any filters are applied. Placement tracks resource capacity and
usage and does *not* track aggregate metadata [1]. Because of this,
placement cannot consider aggregate-based overcommit and will exclude
compute nodes that do not have capacity based on per compute node
overcommit.

How to prepare: if you have been relying on per aggregate overcommit, during
your upgrade to Ocata, you must change to using per compute node overcommit
ratios in order for your scheduling behavior to stay consistent. Otherwise,
you may notice increased NoValidHost scheduling failures as the
aggregate-based overcommit is no longer being considered. You can safely
remove the AggregateCoreFilter, AggregateRamFilter, and AggregateDiskFilter
from your enabled_filters and you do not need to replace them with any other
core/ram/disk filters. The placement query takes care of the core/ram/disk
filtering instead, so CoreFilter, RamFilter, and DiskFilter are redundant.

Thanks,
-melanie

[1] Placement has been a new slate for resource management and prior to
placement, there were conflicts between the different methods for setting
overcommit ratios that were never addressed, such as, "which value to take
if a compute node has overcommit set AND the aggregate has it set? Which
takes precedence?" And, "if a compute node is in more than one aggregate,
which overcommit value should be taken?" So, the ambiguities were not
something that was desirable to bring forward into placement.


So we are a user of this feature and I do have some questions/concerns.

We use this feature to segregate capacity/hosts based on CPU
allocation ratio using aggregates.
This is because we have different offers/flavors based on those
allocation ratios. This is part of our business model.
A flavor extra_specs is use to schedule instances on appropriate hosts
using AggregateInstanceExtraSpecsFilter.

The AggregateInstanceExtraSpecsFilter will continue to work, but thisfilter is run *after* the placement service would have alreadyeliminated compute node records due to placement considering theallocation ratio set for the compute node provider's inventory records.

Our setup has a configuration management system and we use aggregates
exclusively when it comes to allocation ratio.

Yes, that's going to be a problem. You will need to use yourconfiguration management system to write thenova.CONF.XXX_allocation_ratio configuration option values appropriatelyfor each compute node.

We do not rely on cpu_allocation_ratio config in nova-scheduler or nova-compute.
One of the reasons is we do not wish to have to
update/package/redeploy our configuration management system just to
add one or multiple compute nodes to an aggregate/capacity pool.


Yes, I understand.

This means anyone (likely an operator or other provisioning
technician) can perform this action without having to touch or even
know about our configuration management system.
We can also transfer capacity from one aggregate to another if there
is a need, again, using aggregate memberships.

Aggregates don't have "capacity". Aggregates are not capacity pools.Only compute nodes provide resources for guests to consume.


> (we do "evacuate" the

node if there are instances on it)
Our capacity monitoring is based on aggregate memberships and this
offer an easy overview of the current capacity.

By "based on aggregate membership", I believe you are referring to asystem where you have all compute nodes in a particular aggregate onlyschedule instances with a particular flavor "A" and so you manage"capacity" by saying things like "aggregate X can fit 10 more instancesof flavor A in it"?


Do I understand you correctly?

> Note that a host can

be in one and only one aggregate in our setup.

In *your* setup. And that's the only reason this works for you. You'dget totally unpredictable behaviour if your compute nodes were inmultiple aggregates.

What's the migration path for us?

My understanding is that we will now be forced to have people rely on
our configuration management system (which they don't have access to)
to perform simple task we used to be able to do through the API.
I find this unfortunate and I would like to be offered an alternative
solution as the current proposed solution is not acceptable for us.
We are loosing "agility" in our operational tasks.


I see a possible path forward:

We add a new CONF option called "disable_allocation_ratio_autoset". Thisnew CONF option would disable the behaviour of the nova-compute servicein automatically setting the allocation ratio of its inventory recordsfor VCPU, MEMORY_MB and DISK_GB resources.


This would allow you to set compute node allocation ratios in batches.

At first, it might be manual... executing something like this againstthe API database:


 UPDATE inventories
 INNER JOIN resource_provider
 ON inventories.resource_provider_id = resource_provider.id
 AND inventories.resource_class_id = $RESOURCE_CLASS_ID
 INNER JOIN resource_provider_aggregates

ON resource_providers.id =resource_provider_aggregates.resource_provider_id

 INNER JOIN provider_aggregates
 ON resource_provider_aggregates.aggregate_id = provider_aggregates.id
 AND provider_aggregates.uuid = $AGGREGATE_UUID
 SET inventories.allocation_ratio = $NEW_VALUE;

We could follow up with a little CLI tool that would do the above foryou on the command line... something like this:

nova-manage db set_aggregate_placement_allocation_ratio--aggregate_uuid=$AGG_UUID --resource_class=VCPU --ratio 16.0

Of course, you could always call the Placement REST API to override theallocation ratio for particular providers:


 DATA='{"resource_provider_generation": X, "allocation_ratio": $RATIO}'
 curl -XPUT -H "Content-Type: application/json" -H$AUTH_TOKEN -d$DATA \
    https://$PLACEMENT/resource_providers/$RP_UUID/inventories/VCPU

and you could loop through all the resource providers listed under aparticular aggregate, which you can find using something like this:


 curl https://$PLACEMENT/resource_providers?member_of:$AGG_UUID

Anyway, there's multiple ways to set the allocation ratios in batches,as you can tell.

I think the key is somehow disabling the behaviour of the nova-computeservice of overriding the allocation ratio of compute nodes with thevalue of the nova.cnf options.


Thoughts?
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] heads up to users of Aggregate[Core|Ram|Disk]Filter: behavior change in >= Ocata

Reply via email to