On 01/18/2018 03:54 PM, Mathieu Gagné wrote:
Hi,

On Tue, Jan 16, 2018 at 4:24 PM, melanie witt <melwi...@gmail.com> wrote:
Hello Stackers,

This is a heads up to any of you using the AggregateCoreFilter,
AggregateRamFilter, and/or AggregateDiskFilter in the filter scheduler.
These filters have effectively allowed operators to set overcommit ratios
per aggregate rather than per compute node in <= Newton.

Beginning in Ocata, there is a behavior change where aggregate-based
overcommit ratios will no longer be honored during scheduling. Instead,
overcommit values must be set on a per compute node basis in nova.conf.

Details: as of Ocata, instead of considering all compute nodes at the start
of scheduler filtering, an optimization has been added to query resource
capacity from placement and prune the compute node list with the result
*before* any filters are applied. Placement tracks resource capacity and
usage and does *not* track aggregate metadata [1]. Because of this,
placement cannot consider aggregate-based overcommit and will exclude
compute nodes that do not have capacity based on per compute node
overcommit.

How to prepare: if you have been relying on per aggregate overcommit, during
your upgrade to Ocata, you must change to using per compute node overcommit
ratios in order for your scheduling behavior to stay consistent. Otherwise,
you may notice increased NoValidHost scheduling failures as the
aggregate-based overcommit is no longer being considered. You can safely
remove the AggregateCoreFilter, AggregateRamFilter, and AggregateDiskFilter
from your enabled_filters and you do not need to replace them with any other
core/ram/disk filters. The placement query takes care of the core/ram/disk
filtering instead, so CoreFilter, RamFilter, and DiskFilter are redundant.

Thanks,
-melanie

[1] Placement has been a new slate for resource management and prior to
placement, there were conflicts between the different methods for setting
overcommit ratios that were never addressed, such as, "which value to take
if a compute node has overcommit set AND the aggregate has it set? Which
takes precedence?" And, "if a compute node is in more than one aggregate,
which overcommit value should be taken?" So, the ambiguities were not
something that was desirable to bring forward into placement.

So we are a user of this feature and I do have some questions/concerns.

We use this feature to segregate capacity/hosts based on CPU
allocation ratio using aggregates.
This is because we have different offers/flavors based on those
allocation ratios. This is part of our business model.
A flavor extra_specs is use to schedule instances on appropriate hosts
using AggregateInstanceExtraSpecsFilter.

The AggregateInstanceExtraSpecsFilter will continue to work, but this filter is run *after* the placement service would have already eliminated compute node records due to placement considering the allocation ratio set for the compute node provider's inventory records.

Our setup has a configuration management system and we use aggregates
exclusively when it comes to allocation ratio.

Yes, that's going to be a problem. You will need to use your configuration management system to write the nova.CONF.XXX_allocation_ratio configuration option values appropriately for each compute node.

We do not rely on cpu_allocation_ratio config in nova-scheduler or nova-compute.
One of the reasons is we do not wish to have to
update/package/redeploy our configuration management system just to
add one or multiple compute nodes to an aggregate/capacity pool.

Yes, I understand.

This means anyone (likely an operator or other provisioning
technician) can perform this action without having to touch or even
know about our configuration management system.
We can also transfer capacity from one aggregate to another if there
is a need, again, using aggregate memberships.

Aggregates don't have "capacity". Aggregates are not capacity pools. Only compute nodes provide resources for guests to consume.

> (we do "evacuate" the
node if there are instances on it)
Our capacity monitoring is based on aggregate memberships and this
offer an easy overview of the current capacity.

By "based on aggregate membership", I believe you are referring to a system where you have all compute nodes in a particular aggregate only schedule instances with a particular flavor "A" and so you manage "capacity" by saying things like "aggregate X can fit 10 more instances of flavor A in it"?

Do I understand you correctly?

> Note that a host can
be in one and only one aggregate in our setup.

In *your* setup. And that's the only reason this works for you. You'd get totally unpredictable behaviour if your compute nodes were in multiple aggregates.

What's the migration path for us?

My understanding is that we will now be forced to have people rely on
our configuration management system (which they don't have access to)
to perform simple task we used to be able to do through the API.
I find this unfortunate and I would like to be offered an alternative
solution as the current proposed solution is not acceptable for us.
We are loosing "agility" in our operational tasks.

I see a possible path forward:

We add a new CONF option called "disable_allocation_ratio_autoset". This new CONF option would disable the behaviour of the nova-compute service in automatically setting the allocation ratio of its inventory records for VCPU, MEMORY_MB and DISK_GB resources.

This would allow you to set compute node allocation ratios in batches.

At first, it might be manual... executing something like this against the API database:

 UPDATE inventories
 INNER JOIN resource_provider
 ON inventories.resource_provider_id = resource_provider.id
 AND inventories.resource_class_id = $RESOURCE_CLASS_ID
 INNER JOIN resource_provider_aggregates
ON resource_providers.id = resource_provider_aggregates.resource_provider_id
 INNER JOIN provider_aggregates
 ON resource_provider_aggregates.aggregate_id = provider_aggregates.id
 AND provider_aggregates.uuid = $AGGREGATE_UUID
 SET inventories.allocation_ratio = $NEW_VALUE;

We could follow up with a little CLI tool that would do the above for you on the command line... something like this:

nova-manage db set_aggregate_placement_allocation_ratio --aggregate_uuid=$AGG_UUID --resource_class=VCPU --ratio 16.0

Of course, you could always call the Placement REST API to override the allocation ratio for particular providers:

 DATA='{"resource_provider_generation": X, "allocation_ratio": $RATIO}'
 curl -XPUT -H "Content-Type: application/json" -H$AUTH_TOKEN -d$DATA \
    https://$PLACEMENT/resource_providers/$RP_UUID/inventories/VCPU

and you could loop through all the resource providers listed under a particular aggregate, which you can find using something like this:

 curl https://$PLACEMENT/resource_providers?member_of:$AGG_UUID

Anyway, there's multiple ways to set the allocation ratios in batches, as you can tell.

I think the key is somehow disabling the behaviour of the nova-compute service of overriding the allocation ratio of compute nodes with the value of the nova.cnf options.

Thoughts?
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to