Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

Jay Pipes Fri, 09 Jun 2017 14:37:49 -0700

Sorry, been in a three-hour meeting. Comments inline...


On 06/06/2017 10:56 AM, Chris Dent wrote:

On Mon, 5 Jun 2017, Ed Leafe wrote:

One proposal is to essentially use the same logic in placement
that was used to include that host in those matching the
requirements. In other words, when it tries to allocate the amount
of disk, it would determine that that host is in a shared storage
aggregate, and be smart enough to allocate against that provider.
This was referred to in our discussion as "Plan A".


What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.

We'd have to add all the sorting/weighing logic from the existingscheduler into the Placement API. Otherwise, the Placement API won'tunderstand which child provider to pick out of many providers that meetresource/trait requirements.

We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.


a) we can't assume that

b) a compute node could very well have both local disk and shared disk.how would the placement API know which one to pick? This is asorting/weighing decision and thus is something the scheduler isresponsible for.

My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?

No, it's not about determinism or how late in the game a claim decisionis made. It's really just that the scheduler is the thing that doessorting/weighing, not the placement API. We made this decision due tothe operator feedback that they were not willing to give up theirability to add custom weighers and be able to have scheduling policiesthat rely on transient data like thermal metrics collection.

The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?


Correct.

Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?

Partly, yes. But, more than anything it's about the placement APIreturning resource provider UUIDs for child providers and sharingproviders so that the scheduler, when it picks one of those SRIOVphysical functions, or NUMA cells, or shared storage pools, has theidentifier with which to tell the placement API "ok, claim *this*resource against *this* provider".

* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?

The filter scheduler has most of the information, yes. What it doesn'thave is the *identifier* (UUID) for things like SRIOV PFs or NUMA cellsthat the Placement API will use to distinguish between things. In otherwords, the filter scheduler currently does things like unpack aNUMATopology object into memory and determine a NUMA cell to place aninstance to. However, it has no concept that that NUMA cell is (or willsoon be once nested-resource-providers is done) a resource provider inthe placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs,etc. That's why we need to return information to the scheduler from theplacement API that will allow the scheduler to understand "hey, thisNUMA cell on compute node X is resource provider $UUID".

* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells).


Mmm, kinda, yeah.

>  If I

  remember correctly, the modelling and tracking of this kind of
  information in this way comes out of the time when we imagined the
  placement service would be doing considerably more filtering than
  is planned now. Plan B appears to be an acknowledgement of "on
  some of this stuff, we can't actually do anything but provide you
  some info, you need to decide".

Not really. Filtering is still going to be done in the placement API.It's the thing that says "hey, these providers (or trees of providers)meet these resource and trait requirements". The scheduler however iswhat takes that set of filtered providers and does its sorting/weighingmagic and selects one.


> If that's the case, is the

  topological modelling on the placement DB side of things solely a
  convenient place to store information? If there were some other
  way to model that topology could things currently being considered
  for modelling as nested providers be instead simply modelled as
  inventories of a particular class of resource?
  (I'm not suggesting we do this, rather that the answer that says
  why we don't want to do this is useful for understanding the
  picture.)

The modeling of the topologies of providers in the placement API/DB isstrictly to ensure consistency and correctness of representation. We'remodeling the actual relationship between resource providers in a genericway and not embedding that topology information in a variety of JSONblobs and other structs in the cell database.

* Does a claim made in the scheduler need to be complete? Is there
  value in making a partial claim from the scheduler that consumes a
  vcpu and some ram, and then in the resource tracker is corrected
  to consume a specific pci device, numa cell, gpu and/or fpga?
  Would this be better or worse than what we have now? Why?

Good question. I think the answer to this is probably pretty theoreticalat this point. My gut instinct is that we should treat the consumptionof resources in an atomic fashion, and that transactional nature ofallocation will result in fewer race conditions and cleaner code. But,admittedly, this is just my gut reaction.

* What is lacking in placement's representation of resource providers
  that makes it difficult or impossible for an allocation against a
  parent provider to be able to determine the correct child
  providers to which to cascade some of the allocation? (And by
  extension make the earlier scheduling decision.)

See above. The sorting/weighing logic, which is very muchdeployer-defined and wreaks of customization, is what would need to beadded to the placement API.


best,
-jay

That's a start. With answers to at last some of these questions I
think the straw man in the etherpad can be more effectively
evaluated. As things stand right now it is a proposed solution
without a clear problem statement. I feel like we could do with a
more clear problem statement.

Thanks.

[1] https://etherpad.openstack.org/p/placement-allocations-straw-man



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

Reply via email to