Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-12 Thread Jay Pipes

On 06/12/2017 02:17 PM, Edward Leafe wrote:
On Jun 12, 2017, at 10:20 AM, Jay Pipes > wrote:


The RP uuid is part of the provider: the compute node's uuid, and 
(after https://review.openstack.org/#/c/469147/ merges) the PCI 
device's uuid. So in the code that passes the PCI device information 
to the scheduler, we could add that new uuid field, and then the 
scheduler would have the information to a) select the best fit and 
then b) claim it with the specific uuid. Same for all the other 
nested/shared devices.


How would the scheduler know that a particular SRIOV PF resource 
provider UUID is on a particular compute node unless the placement API 
returns information indicating that SRIOV PF is a child of a 
particular compute node resource provider?


Because PCI devices are per compute node. The HostState object populates 
itself from the compute node here:


https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L224-L225

If we add the UUID information to the PCI device, as the above-mentioned 
patch proposes, when the scheduler selects a particular compute node 
that has the device, it uses the PCI device’s UUID. I thought that 
having that information in the scheduler was what that patch was all about.


I would hope that over time, there's be little to no need for the 
scheduler to read either the compute_nodes or the pci_devices tables 
(which, btw, are in the cell databases). The information that the 
scheduler currently keeps in the host state objects should eventually be 
able to be primarily constructed by the returned results from the 
placement API instead of the existing situation where the scheduler must 
make multiple calls to the multiple cells databases in order to fill 
that information in.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-12 Thread Edward Leafe
On Jun 12, 2017, at 10:20 AM, Jay Pipes  wrote:

>> The RP uuid is part of the provider: the compute node's uuid, and (after 
>> https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So 
>> in the code that passes the PCI device information to the scheduler, we 
>> could add that new uuid field, and then the scheduler would have the 
>> information to a) select the best fit and then b) claim it with the specific 
>> uuid. Same for all the other nested/shared devices.
> 
> How would the scheduler know that a particular SRIOV PF resource provider 
> UUID is on a particular compute node unless the placement API returns 
> information indicating that SRIOV PF is a child of a particular compute node 
> resource provider?


Because PCI devices are per compute node. The HostState object populates itself 
from the compute node here:

https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L224-L225
 


If we add the UUID information to the PCI device, as the above-mentioned patch 
proposes, when the scheduler selects a particular compute node that has the 
device, it uses the PCI device’s UUID. I thought that having that information 
in the scheduler was what that patch was all about.

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-12 Thread Jay Pipes

On 06/09/2017 06:31 PM, Ed Leafe wrote:

On Jun 9, 2017, at 4:35 PM, Jay Pipes  wrote:


We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.


a) we can't assume that
b) a compute node could very well have both local disk and shared disk. how 
would the placement API know which one to pick? This is a sorting/weighing 
decision and thus is something the scheduler is responsible for.


I remember having this discussion, and we concluded that a compute node could 
either have local or shared resources, but not both. There would be a trait to 
indicate shared disk. Has this changed?


I'm not sure it's changed per-se :) It's just that there's nothing 
preventing this from happening. A compute node can theoretically have 
local disk and also be associated with a shared storage pool.



* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?


The filter scheduler has most of the information, yes. What it doesn't have is the 
*identifier* (UUID) for things like SRIOV PFs or NUMA cells that the Placement API will 
use to distinguish between things. In other words, the filter scheduler currently does 
things like unpack a NUMATopology object into memory and determine a NUMA cell to place 
an instance to. However, it has no concept that that NUMA cell is (or will soon be once 
nested-resource-providers is done) a resource provider in the placement API. Same for 
SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need to return information 
to the scheduler from the placement API that will allow the scheduler to understand 
"hey, this NUMA cell on compute node X is resource provider $UUID".


I guess that this was the point that confused me. The RP uuid is part of the 
provider: the compute node's uuid, and (after 
https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in 
the code that passes the PCI device information to the scheduler, we could add 
that new uuid field, and then the scheduler would have the information to a) 
select the best fit and then b) claim it with the specific uuid. Same for all 
the other nested/shared devices.


How would the scheduler know that a particular SRIOV PF resource 
provider UUID is on a particular compute node unless the placement API 
returns information indicating that SRIOV PF is a child of a particular 
compute node resource provider?



I don't mean to belabor this, but to my mind this seems a lot less disruptive 
to the existing code.


Belabor away :) I don't mind talking through the details. It's important 
to do.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Chris Dent

On Fri, 9 Jun 2017, Dan Smith wrote:


In other words, I would expect to be able to explain the purpose of the
scheduler as "applies nova-specific logic to the generic resources that
placement says are _valid_, with the goal of determining which one is
_best_".


This sounds great as an explanation. If we can reach this we done good.

--
Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Chris Dent

On Fri, 9 Jun 2017, Jay Pipes wrote:


Sorry, been in a three-hour meeting. Comments inline...


Thanks for getting to this, it's very helpful to me.


* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells).


Mmm, kinda, yeah.


What I meant by this was that if it didn't matter which of more than
one nested rp was used, then it would be easier to simply consider
the group of them as members of an inventory (that came out a bit
more in one of the later questions).


* Does a claim made in the scheduler need to be complete? Is there
  value in making a partial claim from the scheduler that consumes a
  vcpu and some ram, and then in the resource tracker is corrected
  to consume a specific pci device, numa cell, gpu and/or fpga?
  Would this be better or worse than what we have now? Why?


Good question. I think the answer to this is probably pretty theoretical at 
this point. My gut instinct is that we should treat the consumption of 
resources in an atomic fashion, and that transactional nature of allocation 
will result in fewer race conditions and cleaner code. But, admittedly, this 
is just my gut reaction.


I suppose if we were more spread oriented than pack oriented, an
allocation of vcpu and ram would almost operate as a proxy for a
lock, allowing the later correcting allocation proposed above to be
somewhat safe because other near concurrent emplacements would be
happening on some other machine. But we don't have that reality.
I've always been in favor of making the allocation as early as
possible. I remember those halcyon days when we even thought it
might be possible to make a request and claim of resources in one
HTTP request.


  that makes it difficult or impossible for an allocation against a
  parent provider to be able to determine the correct child
  providers to which to cascade some of the allocation? (And by
  extension make the earlier scheduling decision.)


See above. The sorting/weighing logic, which is very much deployer-defined 
and wreaks of customization, is what would need to be added to the placement 
API.


And enough of that sorting/weighing logic is likely to do with child or
shared providers that it's not possible to constrain the weighing
and sorting to solely compute nodes? Not just whether the host is on
fire, but the share disk farm too?

Okay, thank you, that helps set the stage more clearly and leads
straight to my remaining big question, which is asked on the spec
you've proposed:

https://review.openstack.org/#/c/471927/

What are big strokes mechanisms for connecting the non-allocation
data in the response to GET /allocation_requests to the sorting
weighing logic? Answering on the spec works fine for me, I'm just
repeating it here in case people following along want the transition
over to the spec.

Thanks again.

--
Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Dan Smith
>> b) a compute node could very well have both local disk and shared 
>> disk. how would the placement API know which one to pick? This is a
>> sorting/weighing decision and thus is something the scheduler is 
>> responsible for.

> I remember having this discussion, and we concluded that a 
> computenode could either have local or shared resources, but not 
> both. There would be a trait to indicate shared disk. Has this 
> changed?

I've always thought we discussed that one of the benefits of this
approach was that it _could_ have both. Maybe we said "initially we
won't implement stuff so it can have both" but I think the plan has been
that we'd be able to support it.

>>> * We already have the information the filter scheduler needs now
>>>  by some other means, right?  What are the reasons we don't want
>>>  to use that anymore?
>> 
>> The filter scheduler has most of the information, yes. What it 
>> doesn't have is the *identifier* (UUID) for things like SRIOV PFs 
>> or NUMA cells that the Placement API will use to distinguish 
>> between things. In other words, the filter scheduler currently does
>> things like unpack a NUMATopology object into memory and determine
>> a NUMA cell to place an instance to. However, it has no concept
>> that that NUMA cell is (or will soon be once 
>> nested-resource-providers is done) a resource provider in the 
>> placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs,
>>  etc. That's why we need to return information to the scheduler 
>> from the placement API that will allow the scheduler to understand 
>> "hey, this NUMA cell on compute node X is resource provider 
>> $UUID".

Why shouldn't scheduler know those relationships? You were the one (well
one of them :P) that specifically wanted to teach the nova scheduler to
be in the business of arranging and making claims (allocations) against
placement before returning. Why should some parts of the scheduler know
about resource providers, but not others? And, how would scheduler be
able to make the proper decisions (which require knowledge of
hierarchical relationships) without that knowledge? I'm sure I'm missing
something obvious, so please correct me.

IMHO, the scheduler should eventually evolve into a thing that mostly
deals in the currency of placement, translating those into nova concepts
where needed to avoid placement having to know anything about them.
In other words, I would expect to be able to explain the purpose of the
scheduler as "applies nova-specific logic to the generic resources that
placement says are _valid_, with the goal of determining which one is
_best_".

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Ed Leafe
On Jun 9, 2017, at 4:35 PM, Jay Pipes  wrote:

>> We can declare that allocating for shared disk is fairly deterministic
>> if we assume that any given compute node is only associated with one
>> shared disk provider.
> 
> a) we can't assume that
> b) a compute node could very well have both local disk and shared disk. how 
> would the placement API know which one to pick? This is a sorting/weighing 
> decision and thus is something the scheduler is responsible for.

I remember having this discussion, and we concluded that a compute node could 
either have local or shared resources, but not both. There would be a trait to 
indicate shared disk. Has this changed?

>> * We already have the information the filter scheduler needs now by
>>  some other means, right?  What are the reasons we don't want to
>>  use that anymore?
> 
> The filter scheduler has most of the information, yes. What it doesn't have 
> is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells that the 
> Placement API will use to distinguish between things. In other words, the 
> filter scheduler currently does things like unpack a NUMATopology object into 
> memory and determine a NUMA cell to place an instance to. However, it has no 
> concept that that NUMA cell is (or will soon be once 
> nested-resource-providers is done) a resource provider in the placement API. 
> Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need 
> to return information to the scheduler from the placement API that will allow 
> the scheduler to understand "hey, this NUMA cell on compute node X is 
> resource provider $UUID".

I guess that this was the point that confused me. The RP uuid is part of the 
provider: the compute node's uuid, and (after 
https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in 
the code that passes the PCI device information to the scheduler, we could add 
that new uuid field, and then the scheduler would have the information to a) 
select the best fit and then b) claim it with the specific uuid. Same for all 
the other nested/shared devices.

I don't mean to belabor this, but to my mind this seems a lot less disruptive 
to the existing code.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Jay Pipes

Sorry, been in a three-hour meeting. Comments inline...

On 06/06/2017 10:56 AM, Chris Dent wrote:

On Mon, 5 Jun 2017, Ed Leafe wrote:


One proposal is to essentially use the same logic in placement
that was used to include that host in those matching the
requirements. In other words, when it tries to allocate the amount
of disk, it would determine that that host is in a shared storage
aggregate, and be smart enough to allocate against that provider.
This was referred to in our discussion as "Plan A".


What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.


We'd have to add all the sorting/weighing logic from the existing 
scheduler into the Placement API. Otherwise, the Placement API won't 
understand which child provider to pick out of many providers that meet 
resource/trait requirements.



We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.


a) we can't assume that
b) a compute node could very well have both local disk and shared disk. 
how would the placement API know which one to pick? This is a 
sorting/weighing decision and thus is something the scheduler is 
responsible for.



My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?


No, it's not about determinism or how late in the game a claim decision 
is made. It's really just that the scheduler is the thing that does 
sorting/weighing, not the placement API. We made this decision due to 
the operator feedback that they were not willing to give up their 
ability to add custom weighers and be able to have scheduling policies 
that rely on transient data like thermal metrics collection.



The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?


Correct.


Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?


Partly, yes. But, more than anything it's about the placement API 
returning resource provider UUIDs for child providers and sharing 
providers so that the scheduler, when it picks one of those SRIOV 
physical functions, or NUMA cells, or shared storage pools, has the 
identifier with which to tell the placement API "ok, claim *this* 
resource against *this* provider".



* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?


The filter scheduler has most of the information, yes. What it doesn't 
have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells 
that the Placement API will use to distinguish between things. In other 
words, the filter scheduler currently does things like unpack a 
NUMATopology object into memory and determine a NUMA cell to place an 
instance to. However, it has no concept that that NUMA cell is (or will 
soon be once nested-resource-providers is done) a resource provider in 
the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, 
etc. That's why we need to return information to the scheduler from the 
placement API that will allow the scheduler to understand "hey, this 
NUMA cell on compute node X is resource provider $UUID".



* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells).


Mmm, kinda, yeah.

>  If I

  remember correctly, the modelling and tracking of this kind of
  information in this way comes out of the time when we imagined the
  placement service would be doing considerably more filtering than
  is planned now. Plan B appears to be an acknowledgement of "on
  some of this stuff, we can't actually do anything but provide you
  some info, you need to decide".


Not really. Filtering is still going to be done in the placement API. 
It's the thing that says "hey, these providers (or trees of providers) 
meet these resource and trait requirements". The scheduler however is 
what takes that set of filtered providers and does its sorting/weighing 
magic and selects one.


> If that's the case, is the

  topological modelling on the placement DB side of things solely a
  convenient place to store information? If there were some other
  way to model that topology could things currently being considered
  for modelling as nested providers be instead simply modelled as
  inventories of a p

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Dan Smith
>> My current feeling is that we got ourselves into our existing mess
>> of ugly, convoluted code when we tried to add these complex 
>> relationships into the resource tracker and the scheduler. We set
>> out to create the placement engine to bring some sanity back to how
>> we think about things we need to virtualize.
> 
> Sorry, I completely disagree with your assessment of why the
> placement engine exists. We didn't create it to bring some sanity
> back to how we think about things we need to virtualize. We created
> it to add consistency and structure to the representation of
> resources in the system.
> 
> I don't believe that exposing this structured representation of 
> resources is a bad thing or that it is leaking "implementation
> details" out of the placement API. It's not an implementation detail
> that a resource provider is a child of another or that a different
> resource provider is supplying some resource to a group of other
> providers. That's simply an accurate representation of the underlying
> data structures.

This ^.

With the proposal Jay has up, placement is merely exposing some of its
own data structures to a client that has declared what it wants. The
client has made a request for resources, and placement is returning some
allocations that would be valid. None of them are nova-specific at all
-- they're all data structures that you would pass to and/or retrieve
from placement already.

>> I don't know the answer. I'm hoping that we can have a discussion 
>> that might uncover a clear approach, or, at the very least, one
>> that is less murky than the others.
> 
> I really like Dan's idea of returning a list of HTTP request bodies
> for POST /allocations/{consumer_uuid} calls along with a list of
> provider information that the scheduler can use in its
> sorting/weighing algorithms.
> 
> We've put this straw-man proposal here:
> 
> https://review.openstack.org/#/c/471927/
> 
> I'm hoping to keep the conversation going there.

This is the most clear option that we have, in my opinion. It simplifies
what the scheduler has to do, it simplifies what conductor has to do
during a retry, and it minimizes the amount of work that something else
like cinder would have to do to use placement to schedule resources.
Without this, cinder/neutron/whatever has to know about things like
aggregates and hierarchical relationships between providers in order to
make *any* sane decision about selecting resources. If placement returns
valid options with that stuff figured out, then those services can look
at the bits they care about and make a decision.

I'd really like us to use the existing strawman spec as a place to
iterate on what that API would look like, assuming we're going to go
that route, and work on actual code in both placement and the scheduler
to use it. I'm hoping that doing so will help clarify whether this is
the right approach or not, and whether there are other gotchas that we
don't yet have on our radar. We're rapidly running out of runway for
pike here and I feel like we've got to get moving on this or we're going
to have to punt. Since several other things depend on this work, we need
to consider the impact to a lot of our pike commitments if we're not
able to get something merged.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Jay Pipes

On 06/05/2017 05:22 PM, Ed Leafe wrote:

Another proposal involved a change to how placement responds to the
scheduler. Instead of just returning the UUIDs of the compute nodes
that satisfy the required resources, it would include a whole bunch
of additional information in a structured response. A straw man
example of such a response is here:
https://etherpad.openstack.org/p/placement-allocations-straw-man.
This was referred to as "Plan B".


Actually, this was Plan "C". Plan "B" was to modify the return of the 
GET /resource_providers Placement REST API endpoint.


> The main feature of this approach

is that part of that response would be the JSON dict for the
allocation call, containing the specific resource provider UUID for
each resource. This way, when the scheduler selects a host


Important clarification is needed here. The proposal is to have the 
scheduler actually select *more than just the compute host*. The 
scheduler would select the host, any sharing providers and any child 
providers within a host that actually contained the resources/traits 
that the request demanded.


>, it would

simply pass that dict back to the /allocations call, and placement
would be able to do the allocations directly against that
information.

There was another issue raised: simply providing the host UUIDs
didn't give the scheduler enough information in order to run its
filters and weighers. Since the scheduler uses those UUIDs to
construct HostState objects, the specific missing information was
never completely clarified, so I'm just including this aspect of the
conversation for completeness. It is orthogonal to the question of
how to allocate when the resource provider is not "simple".


The specific missing information is the following, but not limited to:

* Whether or not a resource can be provided by a sharing provider or a 
"local provider" or either. For example, assume a compute node that is 
associated with a shared storage pool via an aggregate but that also has 
local disk for instances. The Placement API currently returns just the 
compute host UUID but no indication of whether the compute host has 
local disk to consume from, has shared disk to consume from, or both. 
The scheduler is the thing that must weigh these choices and make a 
choice. The placement API gives the scheduler the choices and the 
scheduler makes a decision based on sorting/weighing algorithms.


It is imperative to remember the reason *why* we decided (way back in 
Portland at the Nova mid-cycle last year) to keep sorting/weighing in 
the Nova scheduler. The reason is because operators (and some 
developers) insisted on being able to weigh the possible choices in ways 
that "could not be pre-determined". In other words, folks wanted to keep 
the existing uber-flexibility and customizability that the scheduler 
weighers (and home-grown weigher plugins) currently allow, including 
being able to sort possible compute hosts by such things as the average 
thermal temperature of the power supply the hardware was connected to 
over the last five minutes (I kid you friggin not.)


* Which SR-IOV physical function should provider an SRIOV_NET_VF 
resource to an instance. Imagine a situation where a compute host has 4 
SR-IOV physical functions, each having some traits representing hardware 
offload support and each having an inventory of 8 SRIOV_NET_VF. 
Currently the scheduler absolutely has the information to pick one of 
these SRIOV physical functions to assign to a workload. What the 
scheduler does *not* have, however, is a way to tell the Placement API 
to consume an SRIOV_NET_VF from that particular physical function. Why? 
Because the scheduler doesn't know that a particular physical function 
even *is* a resource provider in the placement API. *Something* needs to 
inform the scheduler that the physical function is a resource provider 
and has a particular UUID to identify it. This is precisely what the 
proposed GET /allocation_requests HTTP response data provides to the 
scheduler.



My current feeling is that we got ourselves into our existing mess of
ugly, convoluted code when we tried to add these complex
relationships into the resource tracker and the scheduler. We set out
to create the placement engine to bring some sanity back to how we
think about things we need to virtualize.


Sorry, I completely disagree with your assessment of why the placement 
engine exists. We didn't create it to bring some sanity back to how we 
think about things we need to virtualize. We created it to add 
consistency and structure to the representation of resources in the system.


I don't believe that exposing this structured representation of 
resources is a bad thing or that it is leaking "implementation details" 
out of the placement API. It's not an implementation detail that a 
resource provider is a child of another or that a different resource 
provider is supplying some resource to a group of other providers. 
That's simply an accu

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-08 Thread Edward Leafe
Sorry for the top-post, but it seems that nobody has responded to this, and 
there are a lot of important questions that need answers. So I’m simply 
re-posting this so that we don’t get too ahead of ourselves, by planning 
implementations before we fully understand the problem and the implications of 
any proposed solution.


-- Ed Leafe


> On Jun 6, 2017, at 9:56 AM, Chris Dent  wrote:
> 
> On Mon, 5 Jun 2017, Ed Leafe wrote:
> 
>> One proposal is to essentially use the same logic in placement
>> that was used to include that host in those matching the
>> requirements. In other words, when it tries to allocate the amount
>> of disk, it would determine that that host is in a shared storage
>> aggregate, and be smart enough to allocate against that provider.
>> This was referred to in our discussion as "Plan A".
> 
> What would help for me is greater explanation of if and if so, how and
> why, "Plan A" doesn't work for nested resource providers.
> 
> We can declare that allocating for shared disk is fairly deterministic
> if we assume that any given compute node is only associated with one
> shared disk provider.
> 
> My understanding is this determinism is not the case with nested
> resource providers because there's some fairly late in the game
> choosing of which pci device or which numa cell is getting used.
> The existing resource tracking doesn't have this problem because the
> claim of those resources is made very late in the game. < Is this
> correct?
> 
> The problem comes into play when we want to claim from the scheduler
> (or conductor). Additional information is required to choose which
> child providers to use. <- Is this correct?
> 
> Plan B overcomes the information deficit by including more
> information in the response from placement (as straw-manned in the
> etherpad [1]) allowing code in the filter scheduler to make accurate
> claims. <- Is this correct?
> 
> For clarity and completeness in the discussion some questions for
> which we have explicit answers would be useful. Some of these may
> appear ignorant or obtuse and are mostly things we've been over
> before. The goal is to draw out some clear statements in the present
> day to be sure we are all talking about the same thing (or get us
> there if not) modified for what we know now, compared to what we
> knew a week or month ago.
> 
> * We already have the information the filter scheduler needs now by
>  some other means, right?  What are the reasons we don't want to
>  use that anymore?
> 
> * Part of the reason for having nested resource providers is because
>  it can allow affinity/anti-affinity below the compute node (e.g.,
>  workloads on the same host but different numa cells). If I
>  remember correctly, the modelling and tracking of this kind of
>  information in this way comes out of the time when we imagined the
>  placement service would be doing considerably more filtering than
>  is planned now. Plan B appears to be an acknowledgement of "on
>  some of this stuff, we can't actually do anything but provide you
>  some info, you need to decide". If that's the case, is the
>  topological modelling on the placement DB side of things solely a
>  convenient place to store information? If there were some other
>  way to model that topology could things currently being considered
>  for modelling as nested providers be instead simply modelled as
>  inventories of a particular class of resource?
>  (I'm not suggesting we do this, rather that the answer that says
>  why we don't want to do this is useful for understanding the
>  picture.)
> 
> * Does a claim made in the scheduler need to be complete? Is there
>  value in making a partial claim from the scheduler that consumes a
>  vcpu and some ram, and then in the resource tracker is corrected
>  to consume a specific pci device, numa cell, gpu and/or fpga?
>  Would this be better or worse than what we have now? Why?
> 
> * What is lacking in placement's representation of resource providers
>  that makes it difficult or impossible for an allocation against a
>  parent provider to be able to determine the correct child
>  providers to which to cascade some of the allocation? (And by
>  extension make the earlier scheduling decision.)
> 
> That's a start. With answers to at last some of these questions I
> think the straw man in the etherpad can be more effectively
> evaluated. As things stand right now it is a proposed solution
> without a clear problem statement. I feel like we could do with a
> more clear problem statement.
> 
> Thanks.
> 
> [1] https://etherpad.openstack.org/p/placement-allocations-straw-man
> 
> -- 
> Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
> freenode: cdent tw: 
> @anticdent__
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Edward Leafe
On Jun 7, 2017, at 1:44 PM, Mooney, Sean K mailto:sean.k.moo...@intel.com>> wrote:

> [Mooney, Sean K] neutron will need to use nested resource providers to track
> Network backend specific consumable resources in the future also. One example 
> is
> is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
> their hardware based implementation are both a finite consumable
> resources and have numa affinity and there for need to track and nested.
> 
> Another example for neutron would be bandwidth based scheduling / sla 
> enforcement
> where we want to guarantee that a specific bandwidth is available on the 
> selected host
> for a vm to consume. From an ovs/vpp/linux bridge perspective this would 
> likely be tracked at
> the physnet level so when selecting a host we would want to ensure that the 
> physent
> is both available from the host and has enough bandwidth available to resever 
> for the instance.


OK, thanks, this is excellent information.

New question: will the placement service always be able to pick an acceptable 
provider, given that that the request needs X amount of bandwidth? IOW, are 
there other considerations besides quantitative amount (and possibly traits for 
qualitative concerns) that placement simply doesn’t know about? The example I 
have in mind is the case of stack vs. spread, where there are a few available 
providers that can meet the request. The logic for which one to pick can’t be 
in placement, though, as it’s a detail of the calling service. In the case of 
Nova, the assignment of VFs on vNICs usually should be spread, but that is not 
what placement knows, it’s handled by filters/weighers in Nova’s scheduler.

OK, that was a really long way of asking: will Neutron ever need to be able to 
determine the “best” choice from a selection of resource providers? Or will the 
fact that a resource provider has enough of a given resource be all that is 
needed?


-- Ed Leafe








-- Ed Leafe








-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Edward Leafe
On Jun 7, 2017, at 1:44 PM, Mooney, Sean K mailto:sean.k.moo...@intel.com>> wrote:

> [Mooney, Sean K] neutron will need to use nested resource providers to track
> Network backend specific consumable resources in the future also. One example 
> is
> is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
> their hardware based implementation are both a finite consumable
> resources and have numa affinity and there for need to track and nested.
> 
> Another example for neutron would be bandwidth based scheduling / sla 
> enforcement
> where we want to guarantee that a specific bandwidth is available on the 
> selected host
> for a vm to consume. From an ovs/vpp/linux bridge perspective this would 
> likely be tracked at
> the physnet level so when selecting a host we would want to ensure that the 
> physent
> is both available from the host and has enough bandwidth available to resever 
> for the instance.


OK, thanks, this is excellent information.

New question: will the placement service always be able to pick an acceptable 
provider, given that that the request needs X amount of bandwidth? IOW, are 
there other considerations besides quantitative amount (and possibly traits for 
qualitative concerns) that placement simply doesn’t know about? The example I 
have in mind is the case of stack vs. spread, where there are a few available 
providers that can meet the request. The logic for which one to pick can’t be 
in placement, though, as it’s a detail of the calling service. In the case of 
Nova, the assignment of VFs on vNICs usually should be spread, but that is not 
what placement knows, it’s handled by filters/weighers in Nova’s scheduler.

OK, that was a really long way of asking: will Neutron ever need to be able to 
determine the “best” choice from a selection of resource providers? Or will the 
fact that a resource provider has enough of a given resource be all that is 
needed?


-- Ed Leafe








-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Mooney, Sean K


> -Original Message-
> From: Jay Pipes [mailto:jaypi...@gmail.com]
> Sent: Wednesday, June 7, 2017 6:47 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova][scheduler][placement] Allocating
> Complex Resources
> 
> On 06/07/2017 01:00 PM, Edward Leafe wrote:
> > On Jun 6, 2017, at 9:56 AM, Chris Dent  > <mailto:cdent...@anticdent.org>> wrote:
> >>
> >> For clarity and completeness in the discussion some questions for
> >> which we have explicit answers would be useful. Some of these may
> >> appear ignorant or obtuse and are mostly things we've been over
> >> before. The goal is to draw out some clear statements in the present
> >> day to be sure we are all talking about the same thing (or get us
> >> there if not) modified for what we know now, compared to what we
> knew
> >> a week or month ago.
> >
> > One other question that came up: do we have any examples of any
> > service (such as Neutron or Cinder) that would require the modeling
> > for nested providers? Or is this confined to Nova?
> 
> The Cyborg project (accelerators like FPGAs and some vGPUs) need nested
> resource providers to model the relationship between a virtual resource
> context against an accelerator and the compute node itself.
[Mooney, Sean K] neutron will need to use nested resource providers to track
Network backend specific consumable resources in the future also. One example is
is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
their hardware based implementation are both a finite consumable
resources and have numa affinity and there for need to track and nested.

Another example for neutron would be bandwidth based scheduling / sla 
enforcement
where we want to guarantee that a specific bandwidth is available on the 
selected host
for a vm to consume. From an ovs/vpp/linux bridge perspective this would likely 
be tracked at
the physnet level so when selecting a host we would want to ensure that the 
physent
is both available from the host and has enough bandwidth available to resever 
for the instance.

Today nova and neutron do not track either of the above but at least the lather 
has been started
In the sriov context without placemet and should be extended to other non-sriov 
backend. 
Snabb switch actually supports this already with vendor extentions via the 
neutron bining:profile
https://github.com/snabbco/snabb/blob/b7d6d77ba5fd6a6b9306f92466c1779bba2caa31/src/program/snabbnfv/doc/neutron-api-extensions.md#bandwidth-reservation
but nova is not aware of the capacity or availability info when placing the 
instance so if
the host cannot fufill the request the degrade to the least over subscribed 
port.
https://github.com/snabbco/snabb-neutron/blob/master/snabb_neutron/mechanism_snabb.py#L194-L200

with nested resource providers they could harden this request from best effort 
to a guaranteed bandwidth reservation
by informing the placemnt api of the bandwith availability of the physical 
interface and also the numa affinity the interfaces
by created a nested resource provider. 

> 
> Best,
> -jay
> 
> ___
> ___
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Jay Pipes

On 06/07/2017 01:00 PM, Edward Leafe wrote:

On Jun 6, 2017, at 9:56 AM, Chris Dent mailto:cdent...@anticdent.org>> wrote:


For clarity and completeness in the discussion some questions for
which we have explicit answers would be useful. Some of these may
appear ignorant or obtuse and are mostly things we've been over
before. The goal is to draw out some clear statements in the present
day to be sure we are all talking about the same thing (or get us
there if not) modified for what we know now, compared to what we
knew a week or month ago.


One other question that came up: do we have any examples of any service
(such as Neutron or Cinder) that would require the modeling for nested
providers? Or is this confined to Nova?


The Cyborg project (accelerators like FPGAs and some vGPUs) need nested 
resource providers to model the relationship between a virtual resource 
context against an accelerator and the compute node itself.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Edward Leafe
On Jun 6, 2017, at 9:56 AM, Chris Dent  wrote:
> 
> For clarity and completeness in the discussion some questions for
> which we have explicit answers would be useful. Some of these may
> appear ignorant or obtuse and are mostly things we've been over
> before. The goal is to draw out some clear statements in the present
> day to be sure we are all talking about the same thing (or get us
> there if not) modified for what we know now, compared to what we
> knew a week or month ago.


One other question that came up: do we have any examples of any service (such 
as Neutron or Cinder) that would require the modeling for nested providers? Or 
is this confined to Nova?


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Chris Dent

On Mon, 5 Jun 2017, Ed Leafe wrote:


One proposal is to essentially use the same logic in placement
that was used to include that host in those matching the
requirements. In other words, when it tries to allocate the amount
of disk, it would determine that that host is in a shared storage
aggregate, and be smart enough to allocate against that provider.
This was referred to in our discussion as "Plan A".


What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.

We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.

My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?

The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?

Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?

For clarity and completeness in the discussion some questions for
which we have explicit answers would be useful. Some of these may
appear ignorant or obtuse and are mostly things we've been over
before. The goal is to draw out some clear statements in the present
day to be sure we are all talking about the same thing (or get us
there if not) modified for what we know now, compared to what we
knew a week or month ago.

* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?

* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells). If I
  remember correctly, the modelling and tracking of this kind of
  information in this way comes out of the time when we imagined the
  placement service would be doing considerably more filtering than
  is planned now. Plan B appears to be an acknowledgement of "on
  some of this stuff, we can't actually do anything but provide you
  some info, you need to decide". If that's the case, is the
  topological modelling on the placement DB side of things solely a
  convenient place to store information? If there were some other
  way to model that topology could things currently being considered
  for modelling as nested providers be instead simply modelled as
  inventories of a particular class of resource?
  (I'm not suggesting we do this, rather that the answer that says
  why we don't want to do this is useful for understanding the
  picture.)

* Does a claim made in the scheduler need to be complete? Is there
  value in making a partial claim from the scheduler that consumes a
  vcpu and some ram, and then in the resource tracker is corrected
  to consume a specific pci device, numa cell, gpu and/or fpga?
  Would this be better or worse than what we have now? Why?

* What is lacking in placement's representation of resource providers
  that makes it difficult or impossible for an allocation against a
  parent provider to be able to determine the correct child
  providers to which to cascade some of the allocation? (And by
  extension make the earlier scheduling decision.)

That's a start. With answers to at last some of these questions I
think the straw man in the etherpad can be more effectively
evaluated. As things stand right now it is a proposed solution
without a clear problem statement. I feel like we could do with a
more clear problem statement.

Thanks.

[1] https://etherpad.openstack.org/p/placement-allocations-straw-man

--
Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Sylvain Bauza


Le 06/06/2017 15:03, Edward Leafe a écrit :
> On Jun 6, 2017, at 4:14 AM, Sylvain Bauza  > wrote:
>>
>> The Plan A option you mention hides the complexity of the
>> shared/non-shared logic but to the price that it would make scheduling
>> decisions on those criteries impossible unless you put
>> filtering/weighting logic into Placement, which AFAIK we strongly
>> disagree with.
> 
> Not necessarily. Well, not now, for sure, but that’s why we need Traits
> to be integrated into Flavors as soon as possible so that we can make
> requests with qualitative requirements, not just quantitative. When that
> work is done, we can add traits to differentiate local from shared
> storage, just like we have traits to distinguish HDD from SSD. So if a
> VM with only local disk is needed, that will be in the request, and
> placement will never return hosts with shared storage. 
> 

Well, there is a whole difference between defining constraints into
flavors, and making a general constraint on a filter basis which is
opt-able by config.

Operators could claim that they would need to update all their N flavors
in order to achieve a strict separation for not-shared-with resource
providers, which would somehow leak into the fact that users would have
flavors that differ for that aspect.

I'm not saying it's not good to mark traits into flavor extraspecs,
sometimes they're all good, but I do care of the flavor count explosion
if we begin putting all the filtering logic into extraspecs (plus the
fact it can't be config-manageable like filters are at the moment).

-Sylvain

> -- Ed Leafe
> 
> 
> 
> 
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Edward Leafe
On Jun 6, 2017, at 4:14 AM, Sylvain Bauza  wrote:
> 
> The Plan A option you mention hides the complexity of the
> shared/non-shared logic but to the price that it would make scheduling
> decisions on those criteries impossible unless you put
> filtering/weighting logic into Placement, which AFAIK we strongly
> disagree with.


Not necessarily. Well, not now, for sure, but that’s why we need Traits to be 
integrated into Flavors as soon as possible so that we can make requests with 
qualitative requirements, not just quantitative. When that work is done, we can 
add traits to differentiate local from shared storage, just like we have traits 
to distinguish HDD from SSD. So if a VM with only local disk is needed, that 
will be in the request, and placement will never return hosts with shared 
storage. 

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Sylvain Bauza
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256



Le 05/06/2017 23:22, Ed Leafe a écrit :
> We had a very lively discussion this morning during the Scheduler
> subteam meeting, which was continued in a Google hangout. The
> subject was how to handle claiming resources when the Resource
> Provider is not "simple". By "simple", I mean a compute node that
> provides all of the resources itself, as contrasted with a compute
> node that uses a shared storage for disk space, or which has
> complex nested relationships with things such as PCI devices or
> NUMA nodes. The current situation is as follows:
> 
> a) scheduler gets a request with certain resource requirements
> (RAM, disk, CPU, etc.) b) scheduler passes these resource
> requirements to placement, which returns a list of hosts (compute
> nodes) that can satisfy the request. c) scheduler runs these
> through some filters and weighers to get a list ordered by best
> "fit" d) it then tries to claim the resources, by posting to
> placement allocations for these resources against the selected
> host e) once the allocation succeeds, scheduler returns that host
> to conductor to then have the VM built
> 
> (some details for edge cases left out for clarity of the overall
> process)
> 
> The problem we discussed comes into play when the compute node
> isn't the actual provider of the resources. The easiest example to
> consider is when the computes are associated with a shared storage
> provider. The placement query is smart enough to know that even if
> the compute node doesn't have enough local disk, it will get it
> from the shared storage, so it will return that host in step b)
> above. If the scheduler then chooses that host, when it tries to
> claim it, it will pass the resources and the compute node UUID back
> to placement to make the allocations. This is the point where the
> current code would fall short: somehow, placement needs to know to
> allocate the disk requested against the shared storage provider,
> and not the compute node.
> 
> One proposal is to essentially use the same logic in placement that
> was used to include that host in those matching the requirements.
> In other words, when it tries to allocate the amount of disk, it
> would determine that that host is in a shared storage aggregate,
> and be smart enough to allocate against that provider. This was
> referred to in our discussion as "Plan A".
> 
> Another proposal involved a change to how placement responds to the
> scheduler. Instead of just returning the UUIDs of the compute nodes
> that satisfy the required resources, it would include a whole bunch
> of additional information in a structured response. A straw man
> example of such a response is here:
> https://etherpad.openstack.org/p/placement-allocations-straw-man.
> This was referred to as "Plan B". The main feature of this approach
> is that part of that response would be the JSON dict for the
> allocation call, containing the specific resource provider UUID for
> each resource. This way, when the scheduler selects a host, it
> would simply pass that dict back to the /allocations call, and
> placement would be able to do the allocations directly against that
> information.
> 
> There was another issue raised: simply providing the host UUIDs
> didn't give the scheduler enough information in order to run its
> filters and weighers. Since the scheduler uses those UUIDs to
> construct HostState objects, the specific missing information was
> never completely clarified, so I'm just including this aspect of
> the conversation for completeness. It is orthogonal to the question
> of how to allocate when the resource provider is not "simple".
> 
> My current feeling is that we got ourselves into our existing mess
> of ugly, convoluted code when we tried to add these complex
> relationships into the resource tracker and the scheduler. We set
> out to create the placement engine to bring some sanity back to how
> we think about things we need to virtualize. I would really hate to
> see us make the same mistake again, by adding a good deal of
> complexity to handle a few non-simple cases. What I would like to
> avoid, no matter what the eventual solution chosen, is representing
> this complexity in multiple places. Currently the only two
> candidates for this logic are the placement engine, which knows
> about these relationships already, or the compute service itself,
> which has to handle the management of these complex virtualized
> resources.
> 
> I don't know the answer. I'm hoping that we can have a discussion
> that might uncover a clear approach, or, at the very least, one
> that is less murky than the others.
> 

I wasn't part neither of the scheduler meeting nor the hangout (hitted
by French holiday) so I don't get all the details in mind and I could
probably make wrong assumptions, so I apology in advance if I'm
telling anything silly.

That said, I still have some opinions and I'll put them here. Thanks
for having brought up

[openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-05 Thread Ed Leafe
We had a very lively discussion this morning during the Scheduler subteam 
meeting, which was continued in a Google hangout. The subject was how to handle 
claiming resources when the Resource Provider is not "simple". By "simple", I 
mean a compute node that provides all of the resources itself, as contrasted 
with a compute node that uses a shared storage for disk space, or which has 
complex nested relationships with things such as PCI devices or NUMA nodes. The 
current situation is as follows:

a) scheduler gets a request with certain resource requirements (RAM, disk, CPU, 
etc.)
b) scheduler passes these resource requirements to placement, which returns a 
list of hosts (compute nodes) that can satisfy the request.
c) scheduler runs these through some filters and weighers to get a list ordered 
by best "fit"
d) it then tries to claim the resources, by posting to placement allocations 
for these resources against the selected host
e) once the allocation succeeds, scheduler returns that host to conductor to 
then have the VM built

(some details for edge cases left out for clarity of the overall process)

The problem we discussed comes into play when the compute node isn't the actual 
provider of the resources. The easiest example to consider is when the computes 
are associated with a shared storage provider. The placement query is smart 
enough to know that even if the compute node doesn't have enough local disk, it 
will get it from the shared storage, so it will return that host in step b) 
above. If the scheduler then chooses that host, when it tries to claim it, it 
will pass the resources and the compute node UUID back to placement to make the 
allocations. This is the point where the current code would fall short: 
somehow, placement needs to know to allocate the disk requested against the 
shared storage provider, and not the compute node.

One proposal is to essentially use the same logic in placement that was used to 
include that host in those matching the requirements. In other words, when it 
tries to allocate the amount of disk, it would determine that that host is in a 
shared storage aggregate, and be smart enough to allocate against that 
provider. This was referred to in our discussion as "Plan A".

Another proposal involved a change to how placement responds to the scheduler. 
Instead of just returning the UUIDs of the compute nodes that satisfy the 
required resources, it would include a whole bunch of additional information in 
a structured response. A straw man example of such a response is here: 
https://etherpad.openstack.org/p/placement-allocations-straw-man. This was 
referred to as "Plan B". The main feature of this approach is that part of that 
response would be the JSON dict for the allocation call, containing the 
specific resource provider UUID for each resource. This way, when the scheduler 
selects a host, it would simply pass that dict back to the /allocations call, 
and placement would be able to do the allocations directly against that 
information.

There was another issue raised: simply providing the host UUIDs didn't give the 
scheduler enough information in order to run its filters and weighers. Since 
the scheduler uses those UUIDs to construct HostState objects, the specific 
missing information was never completely clarified, so I'm just including this 
aspect of the conversation for completeness. It is orthogonal to the question 
of how to allocate when the resource provider is not "simple".

My current feeling is that we got ourselves into our existing mess of ugly, 
convoluted code when we tried to add these complex relationships into the 
resource tracker and the scheduler. We set out to create the placement engine 
to bring some sanity back to how we think about things we need to virtualize. I 
would really hate to see us make the same mistake again, by adding a good deal 
of complexity to handle a few non-simple cases. What I would like to avoid, no 
matter what the eventual solution chosen, is representing this complexity in 
multiple places. Currently the only two candidates for this logic are the 
placement engine, which knows about these relationships already, or the compute 
service itself, which has to handle the management of these complex virtualized 
resources.

I don't know the answer. I'm hoping that we can have a discussion that might 
uncover a clear approach, or, at the very least, one that is less murky than 
the others.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev