Sorry, been in a three-hour meeting. Comments inline...

On 06/06/2017 10:56 AM, Chris Dent wrote:
On Mon, 5 Jun 2017, Ed Leafe wrote:

One proposal is to essentially use the same logic in placement
that was used to include that host in those matching the
requirements. In other words, when it tries to allocate the amount
of disk, it would determine that that host is in a shared storage
aggregate, and be smart enough to allocate against that provider.
This was referred to in our discussion as "Plan A".

What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.

We'd have to add all the sorting/weighing logic from the existing scheduler into the Placement API. Otherwise, the Placement API won't understand which child provider to pick out of many providers that meet resource/trait requirements.

We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.

a) we can't assume that
b) a compute node could very well have both local disk and shared disk. how would the placement API know which one to pick? This is a sorting/weighing decision and thus is something the scheduler is responsible for.

My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?

No, it's not about determinism or how late in the game a claim decision is made. It's really just that the scheduler is the thing that does sorting/weighing, not the placement API. We made this decision due to the operator feedback that they were not willing to give up their ability to add custom weighers and be able to have scheduling policies that rely on transient data like thermal metrics collection.

The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?

Correct.

Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?

Partly, yes. But, more than anything it's about the placement API returning resource provider UUIDs for child providers and sharing providers so that the scheduler, when it picks one of those SRIOV physical functions, or NUMA cells, or shared storage pools, has the identifier with which to tell the placement API "ok, claim *this* resource against *this* provider".

* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?

The filter scheduler has most of the information, yes. What it doesn't have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells that the Placement API will use to distinguish between things. In other words, the filter scheduler currently does things like unpack a NUMATopology object into memory and determine a NUMA cell to place an instance to. However, it has no concept that that NUMA cell is (or will soon be once nested-resource-providers is done) a resource provider in the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need to return information to the scheduler from the placement API that will allow the scheduler to understand "hey, this NUMA cell on compute node X is resource provider $UUID".

* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells).

Mmm, kinda, yeah.

>  If I
  remember correctly, the modelling and tracking of this kind of
  information in this way comes out of the time when we imagined the
  placement service would be doing considerably more filtering than
  is planned now. Plan B appears to be an acknowledgement of "on
  some of this stuff, we can't actually do anything but provide you
  some info, you need to decide".

Not really. Filtering is still going to be done in the placement API. It's the thing that says "hey, these providers (or trees of providers) meet these resource and trait requirements". The scheduler however is what takes that set of filtered providers and does its sorting/weighing magic and selects one.

> If that's the case, is the
  topological modelling on the placement DB side of things solely a
  convenient place to store information? If there were some other
  way to model that topology could things currently being considered
  for modelling as nested providers be instead simply modelled as
  inventories of a particular class of resource?
  (I'm not suggesting we do this, rather that the answer that says
  why we don't want to do this is useful for understanding the
  picture.)

The modeling of the topologies of providers in the placement API/DB is strictly to ensure consistency and correctness of representation. We're modeling the actual relationship between resource providers in a generic way and not embedding that topology information in a variety of JSON blobs and other structs in the cell database.

* Does a claim made in the scheduler need to be complete? Is there
  value in making a partial claim from the scheduler that consumes a
  vcpu and some ram, and then in the resource tracker is corrected
  to consume a specific pci device, numa cell, gpu and/or fpga?
  Would this be better or worse than what we have now? Why?

Good question. I think the answer to this is probably pretty theoretical at this point. My gut instinct is that we should treat the consumption of resources in an atomic fashion, and that transactional nature of allocation will result in fewer race conditions and cleaner code. But, admittedly, this is just my gut reaction.

* What is lacking in placement's representation of resource providers
  that makes it difficult or impossible for an allocation against a
  parent provider to be able to determine the correct child
  providers to which to cascade some of the allocation? (And by
  extension make the earlier scheduling decision.)

See above. The sorting/weighing logic, which is very much deployer-defined and wreaks of customization, is what would need to be added to the placement API.

best,
-jay

That's a start. With answers to at last some of these questions I
think the straw man in the etherpad can be more effectively
evaluated. As things stand right now it is a proposed solution
without a clear problem statement. I feel like we could do with a
more clear problem statement.

Thanks.

[1] https://etherpad.openstack.org/p/placement-allocations-straw-man



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to