Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread James Penick
>Affinity is mostly meaningless with baremetal. It's entirely a
>virtualization related thing. If you try and group things by TOR, or
>chassis, or anything else, it's going to start meaning something entirely
>different than it means in Nova,

I disagree, in fact, we need TOR and power affinity/anti-affinity for VMs
as well as baremetal. As an example, there are cases where certain compute
resources move significant amounts of data between one or two other
instances, but you want to ensure those instances are not on the same
hypervisor. In that scenario it makes sense to have instances on different
hypervisors, but on the same TOR to reduce unnecessary traffic across the
fabric.

>and it would probably be better to just
>make lots of AZ's and have users choose their AZ mix appropriately,
>since that is the real meaning of AZ's.

Yes, at some level certain things should be expressed in the form of an AZ,
power seems like a good candidate for that. But , expressing something like
a TOR as an AZ in an environment with hundreds of thousands of physical
hosts, would not scale. Further, it would require users to have a deeper
understanding of datacenter toplogy, which is exactly the opposite of why
IaaS exists.

The whole point of a service-oriented infrastructure is to be able to give
the end user the ability to boot compute resources that match a variety of
constraints, and have those resources selected and provisioned for them. IE
"Give me 12 instances of m1.blah, all running Linux, and make sure they're
spread across 6 different TORs and 2 different power domains in network
zone Blah."







On Wed, Dec 16, 2015 at 10:38 AM, Clint Byrum  wrote:

> Excerpts from Jim Rollenhagen's message of 2015-12-16 08:03:22 -0800:
> > Nobody is talking about running a compute per flavor or capability. All
> > compute hosts will be able to handle all ironic nodes. We *do* still
> > need to figure out how to handle availability zones or host aggregates,
> > but I expect we would pass along that data to be matched against. I
> > think it would just be metadata on a node. Something like
> > node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> > you. Ditto for host aggregates - add the metadata to ironic to match
> > what's in the host aggregate. I'm honestly not sure what to do about
> > (anti-)affinity filters; we'll need help figuring that out.
> >
>
> Affinity is mostly meaningless with baremetal. It's entirely a
> virtualization related thing. If you try and group things by TOR, or
> chassis, or anything else, it's going to start meaning something entirely
> different than it means in Nova, and it would probably be better to just
> make lots of AZ's and have users choose their AZ mix appropriately,
> since that is the real meaning of AZ's.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread James Penick
Someone else expressed this more gracefully than I:

*'Because sans Ironic, compute-nodes still have physical characteristics*
*that make grouping on them attractive for things like anti-affinity. I*
*don't really want my HA instances "not on the same compute node", I want*
*them "not in the same failure domain". It becomes a way for all*
*OpenStack workloads to have more granularity than "availability zone".*
(
https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg14891.html
)

^That guy definitely has a good head on his shoulders ;)

-James


On Wed, Dec 16, 2015 at 12:40 PM, James Penick  wrote:

> >Affinity is mostly meaningless with baremetal. It's entirely a
> >virtualization related thing. If you try and group things by TOR, or
> >chassis, or anything else, it's going to start meaning something entirely
> >different than it means in Nova,
>
> I disagree, in fact, we need TOR and power affinity/anti-affinity for VMs
> as well as baremetal. As an example, there are cases where certain compute
> resources move significant amounts of data between one or two other
> instances, but you want to ensure those instances are not on the same
> hypervisor. In that scenario it makes sense to have instances on different
> hypervisors, but on the same TOR to reduce unnecessary traffic across the
> fabric.
>
> >and it would probably be better to just
> >make lots of AZ's and have users choose their AZ mix appropriately,
> >since that is the real meaning of AZ's.
>
> Yes, at some level certain things should be expressed in the form of an
> AZ, power seems like a good candidate for that. But , expressing something
> like a TOR as an AZ in an environment with hundreds of thousands of
> physical hosts, would not scale. Further, it would require users to have a
> deeper understanding of datacenter toplogy, which is exactly the opposite
> of why IaaS exists.
>
> The whole point of a service-oriented infrastructure is to be able to give
> the end user the ability to boot compute resources that match a variety of
> constraints, and have those resources selected and provisioned for them. IE
> "Give me 12 instances of m1.blah, all running Linux, and make sure they're
> spread across 6 different TORs and 2 different power domains in network
> zone Blah."
>
>
>
>
>
>
>
> On Wed, Dec 16, 2015 at 10:38 AM, Clint Byrum  wrote:
>
>> Excerpts from Jim Rollenhagen's message of 2015-12-16 08:03:22 -0800:
>> > Nobody is talking about running a compute per flavor or capability. All
>> > compute hosts will be able to handle all ironic nodes. We *do* still
>> > need to figure out how to handle availability zones or host aggregates,
>> > but I expect we would pass along that data to be matched against. I
>> > think it would just be metadata on a node. Something like
>> > node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
>> > you. Ditto for host aggregates - add the metadata to ironic to match
>> > what's in the host aggregate. I'm honestly not sure what to do about
>> > (anti-)affinity filters; we'll need help figuring that out.
>> >
>>
>> Affinity is mostly meaningless with baremetal. It's entirely a
>> virtualization related thing. If you try and group things by TOR, or
>> chassis, or anything else, it's going to start meaning something entirely
>> different than it means in Nova, and it would probably be better to just
>> make lots of AZ's and have users choose their AZ mix appropriately,
>> since that is the real meaning of AZ's.
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread Clint Byrum
Excerpts from Jim Rollenhagen's message of 2015-12-16 08:03:22 -0800:
> Nobody is talking about running a compute per flavor or capability. All
> compute hosts will be able to handle all ironic nodes. We *do* still
> need to figure out how to handle availability zones or host aggregates,
> but I expect we would pass along that data to be matched against. I
> think it would just be metadata on a node. Something like
> node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> you. Ditto for host aggregates - add the metadata to ironic to match
> what's in the host aggregate. I'm honestly not sure what to do about
> (anti-)affinity filters; we'll need help figuring that out.
> 

Affinity is mostly meaningless with baremetal. It's entirely a
virtualization related thing. If you try and group things by TOR, or
chassis, or anything else, it's going to start meaning something entirely
different than it means in Nova, and it would probably be better to just
make lots of AZ's and have users choose their AZ mix appropriately,
since that is the real meaning of AZ's.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread James Penick
>We actually called out this problem in the Ironic midcycle and the Tokyo
>summit - we decided to report Ironic's total capacity from each compute
>host (resulting in over-reporting from Nova), and real capacity (for
>purposes of reporting, monitoring, whatever) should be fetched by
>operators from Ironic (IIRC, you specifically were okay with this
>limitation). This is still wrong, but it's the least wrong of any option
>(yes, all are wrong in some way). See the spec[1] for more details.

I do recall that discussion, but the merged spec says:

"In general, a nova-compute running the Ironic virt driver should expose
(total resources)/(number of compute services). This allows for resources
to be
sharded across multiple compute services without over-reporting resources."

I agree that what you said via email is Less Awful than what I read on the
spec (Did I misread it? Am I full of crazy?)

>We *do* still
>need to figure out how to handle availability zones or host aggregates,
>but I expect we would pass along that data to be matched against. I
>think it would just be metadata on a node. Something like
>node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
>you. Ditto for host aggregates - add the metadata to ironic to match
>what's in the host aggregate. I'm honestly not sure what to do about
>(anti-)affinity filters; we'll need help figuring that out.
&
>Right, I didn't mean gantt specifically, but rather "splitting out the
>scheduler" like folks keep talking about. That's why I said "actually
>exists". :)

 I think splitting out the scheduler isn't going to be realistic. My
feeling is, if Nova is going to fulfill its destiny of being The Compute
Service, then the scheduler will stay put and the VM pieces will split out
into another service (Which I think should be named "Seamus" so I can refer
to it as "The wee baby Seamus").

(re: ironic maintaining host aggregates)
>Yes, and yes, assuming those things are valuable to our users. The
>former clearly is, the latter will clearly depend on the change but I
>expect we will evolve to continue to fit Nova's model of the world
>(after all, fitting into Nova's model is a huge chunk of what we do, and
>is exactly what we're trying to do with this work).

It's a lot easier to fit into the nova model if we just use what's there
and don't bother trying to replicate it.

>Again, the other solutions I'm seeing that *do* solve more problems are:
>* Rewrite the resource tracker

>Do you have an entire team (yes, it will take a relatively large team,
>especially when you include some cores dedicated to reviewing the code)
>that can dedicate a couple of development cycles to one of these?

 We can certainly help.

>I sure
>don't. If and when we do, we can move forward on that and deprecate this
>model, if we find that to be a useful thing to do at that time. Right
>now, this is the best plan I have, that we can commit to completing in a
>reasonable timeframe.

I respect that you're trying to solve the problem we have right now to make
operators lives Suck Less. But I think that a short term decision made now
would hurt a lot more later on.

-James

On Wed, Dec 16, 2015 at 8:03 AM, Jim Rollenhagen 
wrote:

> On Tue, Dec 15, 2015 at 05:19:19PM -0800, James Penick wrote:
> > > getting rid of the raciness of ClusteredComputeManager in my
> > >current deployment. And I'm willing to help other operators do the same.
> >
> >  You do alleviate race, but at the cost of complexity and
> > unpredictability.  Breaking that down, let's say we go with the current
> > plan and the compute host abstracts hardware specifics from Nova.  The
> > compute host will report (sum of resources)/(sum of managed compute).  If
> > the hardware beneath that compute host is heterogenous, then the
> resources
> > reported up to nova are not correct, and that really does have
> significant
> > impact on deployers.
> >
> >  As an example: Let's say we have 20 nodes behind a compute process.
> Half
> > of those nodes have 24T of disk, the other have 1T.  An attempt to
> schedule
> > a node with 24T of disk will fail, because Nova scheduler is only aware
> of
> > 12.5T of disk.
>
> We actually called out this problem in the Ironic midcycle and the Tokyo
> summit - we decided to report Ironic's total capacity from each compute
> host (resulting in over-reporting from Nova), and real capacity (for
> purposes of reporting, monitoring, whatever) should be fetched by
> operators from Ironic (IIRC, you specifically were okay with this
> limitation). This is still wrong, but it's the least wrong of any option
> (yes, all are wrong in some way). See the spec[1] for more details.
>
> >  Ok, so one could argue that you should just run two compute processes
> per
> > type of host (N+1 redundancy).  If you have different raid levels on two
> > otherwise identical hosts, you'll now need a new compute process for each
> > variant of hardware.  What about host aggregates or 

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread Andrew Laski

On 12/16/15 at 12:40pm, James Penick wrote:

Affinity is mostly meaningless with baremetal. It's entirely a
virtualization related thing. If you try and group things by TOR, or
chassis, or anything else, it's going to start meaning something entirely
different than it means in Nova,


I disagree, in fact, we need TOR and power affinity/anti-affinity for VMs
as well as baremetal. As an example, there are cases where certain compute
resources move significant amounts of data between one or two other
instances, but you want to ensure those instances are not on the same
hypervisor. In that scenario it makes sense to have instances on different
hypervisors, but on the same TOR to reduce unnecessary traffic across the
fabric.


I think the point was that affinity/anti-affinity as it's defined today 
within Nova does not have any real meaning for baremetal.  The scope is 
a single host and baremetal won't have two instances on the same host so 
by default you have anti-affinity and asking for affinity doesn't make 
sense.


There's a WIP spec proposing scoped policies for server groups that I 
think addresses the case you outlined 
https://review.openstack.org/#/c/247654/.  It's affinity/anti-affinity 
at a different level.  It may help the discussion to differentiate 
between the general concept of affinity/anti-affinity which could 
apply to many different scopes and the current Nova definition of those 
concepts which has a very specific scope.






and it would probably be better to just
make lots of AZ's and have users choose their AZ mix appropriately,
since that is the real meaning of AZ's.


Yes, at some level certain things should be expressed in the form of an AZ,
power seems like a good candidate for that. But , expressing something like
a TOR as an AZ in an environment with hundreds of thousands of physical
hosts, would not scale. Further, it would require users to have a deeper
understanding of datacenter toplogy, which is exactly the opposite of why
IaaS exists.

The whole point of a service-oriented infrastructure is to be able to give
the end user the ability to boot compute resources that match a variety of
constraints, and have those resources selected and provisioned for them. IE
"Give me 12 instances of m1.blah, all running Linux, and make sure they're
spread across 6 different TORs and 2 different power domains in network
zone Blah."



I think the above spec covers this.  The difference to me is that AZs 
require the user to think about absolute placements while the spec 
offers a means to think about relative placements.











On Wed, Dec 16, 2015 at 10:38 AM, Clint Byrum  wrote:


Excerpts from Jim Rollenhagen's message of 2015-12-16 08:03:22 -0800:
> Nobody is talking about running a compute per flavor or capability. All
> compute hosts will be able to handle all ironic nodes. We *do* still
> need to figure out how to handle availability zones or host aggregates,
> but I expect we would pass along that data to be matched against. I
> think it would just be metadata on a node. Something like
> node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> you. Ditto for host aggregates - add the metadata to ironic to match
> what's in the host aggregate. I'm honestly not sure what to do about
> (anti-)affinity filters; we'll need help figuring that out.
>

Affinity is mostly meaningless with baremetal. It's entirely a
virtualization related thing. If you try and group things by TOR, or
chassis, or anything else, it's going to start meaning something entirely
different than it means in Nova, and it would probably be better to just
make lots of AZ's and have users choose their AZ mix appropriately,
since that is the real meaning of AZ's.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread melanie witt
On Dec 10, 2015, at 15:57, Devananda van der Veen  
wrote:

> So, at this point, I think we need to accept that the scheduling of 
> virtualized and bare metal workloads are two different problem domains that 
> are equally complex.
> 
> Either, we:
> * build a separate scheduler process in Ironic, forking the Nova scheduler as 
> a starting point so as to be compatible with existing plugins; or
> * begin building a direct integration between nova-scheduler and ironic, and 
> create a non-virtualization-centric resource tracker within Nova; or
> * proceed with the plan we previously outlined, accept that this isn't going 
> to be backwards compatible with nova filter plugins, and apologize to any 
> operators who rely on the using the same scheduler plugins for baremetal and 
> virtual resources; or
> * keep punting on this, bringing pain and suffering to all operators of bare 
> metal clouds, because nova-compute must be run as exactly one process for all 
> sizes of clouds.

Speaking only for myself, I find the current direction unfortunate, but at the 
same time understandable, given how long it’s been discussed and the need to 
act now.

It becomes apparent to me when I think about the future picture, if I imagine 
what the Compute API is should look like for all end users of 
vm/baremetal/container. They should be able to call one API to create an 
instance and the cloud will do the right things. I can see Nova being that API 
(entrypoint + scheduling, then handoff via driver to vm/baremetal/container 
API). An alternative would be a separate, new frontend API that hands off to a 
separate scheduling API (scheduler break out) that hands off to the various 
compute APIs (vm/baremetal/container).

I realized that if we were able to do a 1:1 ratio of nova-compute to Ironic 
node, everything would work fine as-is. But I understand the problems with that 
as nova-compute processes can’t be run on the inventory nodes themselves, so 
you’re left with a ton of processes that you would have to find a place to run 
and it’s wasteful. Ironic doesn’t “fit in” to the model of 1:1 nova-compute to 
resource.

My concern with the current plan is the need to sync constructs like aggregates 
and availability zones from one system (Nova) to the other (Ironic) in 
perpetuity. Users will have to set them up in both systems and keep them in 
sync. The code itself also has to be effectively duplicated along with filters 
and kept in sync. Eventually each of Nova and Ironic would be separate 
standalone systems, I imagine, to avoid having the sync issues.

I’d rather we provided something like a more generic “Resource View API” in 
Nova that allows baremetal/container/clustered hypervisor environments to 
report resources via a REST API, and scheduling would occur based on the 
resources table (instead of having resource trackers). Each environment 
reporting resources would provide corresponding in-tree Nova scheduler filters 
that know what to do with resources related to them. Then scheduling would 
select a resource and lookup the compute host responsible for that resource, 
and nova-compute would delegate the chosen resource to, for example, Ironic.

This same concept could exist in a separate scheduler service instead of Nova, 
but I don’t see why it can’t be in Nova. I figure we could either enhance Nova 
and eventually forklift the virtualization driver code out into a thin service 
that manages vms, or we could build a new frontend service and a scheduling 
service, and forklift the scheduling bits out of Nova so that it ends up being 
a thin service. The end result seems really similar to me, though one could 
argue that there are other systems that want to share scheduling code that 
aren’t provisioning compute, and thus scheduling would have to move out of Nova 
anyway.

With the current direction, I see things going separate standalone with 
duplicated constructs and then eventually refactored to use common services 
down the road if and when they exist.

I would personally prefer a direction toward something like a Resource View API 
in Nova that generalizes resources to avoid compute services, like Ironic, 
having to duplicate scheduling, aggregates, availability zones, etc.

-melanie








signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread Jim Rollenhagen
On Wed, Dec 16, 2015 at 01:51:47PM -0800, James Penick wrote:
> >We actually called out this problem in the Ironic midcycle and the Tokyo
> >summit - we decided to report Ironic's total capacity from each compute
> >host (resulting in over-reporting from Nova), and real capacity (for
> >purposes of reporting, monitoring, whatever) should be fetched by
> >operators from Ironic (IIRC, you specifically were okay with this
> >limitation). This is still wrong, but it's the least wrong of any option
> >(yes, all are wrong in some way). See the spec[1] for more details.
> 
> I do recall that discussion, but the merged spec says:
> 
> "In general, a nova-compute running the Ironic virt driver should expose
> (total resources)/(number of compute services). This allows for resources
> to be
> sharded across multiple compute services without over-reporting resources."
> 
> I agree that what you said via email is Less Awful than what I read on the
> spec (Did I misread it? Am I full of crazy?)

Oh wow, that was totally missed when we figured that problem out. If you
look down a few paragraphs (under what the reservation request looks
like), it's got more correct words. Sorry about that.

This should clear it up: https://review.openstack.org/#/c/258687/

> >We *do* still
> >need to figure out how to handle availability zones or host aggregates,
> >but I expect we would pass along that data to be matched against. I
> >think it would just be metadata on a node. Something like
> >node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> >you. Ditto for host aggregates - add the metadata to ironic to match
> >what's in the host aggregate. I'm honestly not sure what to do about
> >(anti-)affinity filters; we'll need help figuring that out.
> &
> >Right, I didn't mean gantt specifically, but rather "splitting out the
> >scheduler" like folks keep talking about. That's why I said "actually
> >exists". :)
> 
>  I think splitting out the scheduler isn't going to be realistic. My
> feeling is, if Nova is going to fulfill its destiny of being The Compute
> Service, then the scheduler will stay put and the VM pieces will split out
> into another service (Which I think should be named "Seamus" so I can refer
> to it as "The wee baby Seamus").

Sure, that's honestly the best option, but will take even longer. :)

> (re: ironic maintaining host aggregates)
> >Yes, and yes, assuming those things are valuable to our users. The
> >former clearly is, the latter will clearly depend on the change but I
> >expect we will evolve to continue to fit Nova's model of the world
> >(after all, fitting into Nova's model is a huge chunk of what we do, and
> >is exactly what we're trying to do with this work).
> 
> It's a lot easier to fit into the nova model if we just use what's there
> and don't bother trying to replicate it.

The problem is, the Nova model is "one compute service per physical
host". This is actually *much* easier to implement, if you want to run a
compute service per physical host. :)

> >Again, the other solutions I'm seeing that *do* solve more problems are:
> >* Rewrite the resource tracker
> 
> >Do you have an entire team (yes, it will take a relatively large team,
> >especially when you include some cores dedicated to reviewing the code)
> >that can dedicate a couple of development cycles to one of these?
> 
>  We can certainly help.
> 
> >I sure
> >don't. If and when we do, we can move forward on that and deprecate this
> >model, if we find that to be a useful thing to do at that time. Right
> >now, this is the best plan I have, that we can commit to completing in a
> >reasonable timeframe.
> 
> I respect that you're trying to solve the problem we have right now to make
> operators lives Suck Less. But I think that a short term decision made now
> would hurt a lot more later on.

Yeah, I think that's the biggest disagreement here; I don't think we're
blocking any work to make this even better in the future, just taking a
step toward that. It will be extra work to unwind, and I think it's
worth the tradeoff.

// jim

> -James
> 
> On Wed, Dec 16, 2015 at 8:03 AM, Jim Rollenhagen 
> wrote:
> 
> > On Tue, Dec 15, 2015 at 05:19:19PM -0800, James Penick wrote:
> > > > getting rid of the raciness of ClusteredComputeManager in my
> > > >current deployment. And I'm willing to help other operators do the same.
> > >
> > >  You do alleviate race, but at the cost of complexity and
> > > unpredictability.  Breaking that down, let's say we go with the current
> > > plan and the compute host abstracts hardware specifics from Nova.  The
> > > compute host will report (sum of resources)/(sum of managed compute).  If
> > > the hardware beneath that compute host is heterogenous, then the
> > resources
> > > reported up to nova are not correct, and that really does have
> > significant
> > > impact on deployers.
> > >
> > >  As an example: Let's say we have 20 nodes behind a compute process.
> > Half

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread Chris Dent

On Wed, 16 Dec 2015, melanie witt wrote:


I’d rather we provided something like a more generic “Resource
View API” in Nova that allows baremetal/container/clustered
hypervisor environments to report resources via a REST API, and
scheduling would occur based on the resources table (instead of having
resource trackers). Each environment reporting resources would provide
corresponding in-tree Nova scheduler filters that know what to do with
resources related to them. Then scheduling would select a resource and
lookup the compute host responsible for that resource, and nova-
compute would delegate the chosen resource to, for example, Ironic.


That ^ makes me think of this: https://review.openstack.org/#/c/253187/

Seem to be in at least similar veins.

--
Chris Dent   (�s°□°)�s�喋擤ォ�http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-16 Thread Jim Rollenhagen
On Tue, Dec 15, 2015 at 05:19:19PM -0800, James Penick wrote:
> > getting rid of the raciness of ClusteredComputeManager in my
> >current deployment. And I'm willing to help other operators do the same.
> 
>  You do alleviate race, but at the cost of complexity and
> unpredictability.  Breaking that down, let's say we go with the current
> plan and the compute host abstracts hardware specifics from Nova.  The
> compute host will report (sum of resources)/(sum of managed compute).  If
> the hardware beneath that compute host is heterogenous, then the resources
> reported up to nova are not correct, and that really does have significant
> impact on deployers.
> 
>  As an example: Let's say we have 20 nodes behind a compute process.  Half
> of those nodes have 24T of disk, the other have 1T.  An attempt to schedule
> a node with 24T of disk will fail, because Nova scheduler is only aware of
> 12.5T of disk.

We actually called out this problem in the Ironic midcycle and the Tokyo
summit - we decided to report Ironic's total capacity from each compute
host (resulting in over-reporting from Nova), and real capacity (for
purposes of reporting, monitoring, whatever) should be fetched by
operators from Ironic (IIRC, you specifically were okay with this
limitation). This is still wrong, but it's the least wrong of any option
(yes, all are wrong in some way). See the spec[1] for more details.

>  Ok, so one could argue that you should just run two compute processes per
> type of host (N+1 redundancy).  If you have different raid levels on two
> otherwise identical hosts, you'll now need a new compute process for each
> variant of hardware.  What about host aggregates or availability zones?
> This sounds like an N^2 problem.  A mere 2 host flavors spread across 2
> availability zones means 8 compute processes.
> 
> I have hundreds of hardware flavors, across different security, network,
> and power availability zones.

Nobody is talking about running a compute per flavor or capability. All
compute hosts will be able to handle all ironic nodes. We *do* still
need to figure out how to handle availability zones or host aggregates,
but I expect we would pass along that data to be matched against. I
think it would just be metadata on a node. Something like
node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
you. Ditto for host aggregates - add the metadata to ironic to match
what's in the host aggregate. I'm honestly not sure what to do about
(anti-)affinity filters; we'll need help figuring that out.

> >None of this precludes getting to a better world where Gantt actually
> >exists, or the resource tracker works well with Ironic.
> 
> It doesn't preclude it, no. But Gantt is dead[1], and I haven't seen any
> movement to bring it back.

Right, I didn't mean gantt specifically, but rather "splitting out the
scheduler" like folks keep talking about. That's why I said "actually
exists". :)

> >It just gets us to an incrementally better model in the meantime.
> 
>  I strongly disagree. Will Ironic manage its own concept of availability
> zones and host aggregates?  What if nova changes their model, will Ironic
> change to mirror it?  If not I now need to model the same topology in two
> different ways.

Yes, and yes, assuming those things are valuable to our users. The
former clearly is, the latter will clearly depend on the change but I
expect we will evolve to continue to fit Nova's model of the world
(after all, fitting into Nova's model is a huge chunk of what we do, and
is exactly what we're trying to do with this work).

>  In that context, breaking out scheduling and "hiding" ironic resources
> behind a compute process is going to create more problems than it will
> solve, and is not the "Least bad" of the options to me.

Again, the other solutions I'm seeing that *do* solve more problems are:

* Rewrite the resource tracker
* Break out the scheduler into a separate thing

Do you have an entire team (yes, it will take a relatively large team,
especially when you include some cores dedicated to reviewing the code)
that can dedicate a couple of development cycles to one of these? I sure
don't. If and when we do, we can move forward on that and deprecate this
model, if we find that to be a useful thing to do at that time. Right
now, this is the best plan I have, that we can commit to completing in a
reasonable timeframe.

// jim

> 
> -James
> [1] http://git.openstack.org/cgit/openstack/gantt/tree/README.rst
> 
> On Mon, Dec 14, 2015 at 5:28 PM, Jim Rollenhagen 
> wrote:
> 
> > On Mon, Dec 14, 2015 at 04:15:42PM -0800, James Penick wrote:
> > > I'm very much against it.
> > >
> > >  In my environment we're going to be depending heavily on the nova
> > > scheduler for affinity/anti-affinity of physical datacenter constructs,
> > > TOR, Power, etc. Like other operators we need to also have a concept of
> > > host aggregates and availability zones for our baremetal as well. If
> 

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-15 Thread James Penick
> getting rid of the raciness of ClusteredComputeManager in my
>current deployment. And I'm willing to help other operators do the same.

 You do alleviate race, but at the cost of complexity and
unpredictability.  Breaking that down, let's say we go with the current
plan and the compute host abstracts hardware specifics from Nova.  The
compute host will report (sum of resources)/(sum of managed compute).  If
the hardware beneath that compute host is heterogenous, then the resources
reported up to nova are not correct, and that really does have significant
impact on deployers.

 As an example: Let's say we have 20 nodes behind a compute process.  Half
of those nodes have 24T of disk, the other have 1T.  An attempt to schedule
a node with 24T of disk will fail, because Nova scheduler is only aware of
12.5T of disk.

 Ok, so one could argue that you should just run two compute processes per
type of host (N+1 redundancy).  If you have different raid levels on two
otherwise identical hosts, you'll now need a new compute process for each
variant of hardware.  What about host aggregates or availability zones?
This sounds like an N^2 problem.  A mere 2 host flavors spread across 2
availability zones means 8 compute processes.

I have hundreds of hardware flavors, across different security, network,
and power availability zones.

>None of this precludes getting to a better world where Gantt actually
>exists, or the resource tracker works well with Ironic.

It doesn't preclude it, no. But Gantt is dead[1], and I haven't seen any
movement to bring it back.

>It just gets us to an incrementally better model in the meantime.

 I strongly disagree. Will Ironic manage its own concept of availability
zones and host aggregates?  What if nova changes their model, will Ironic
change to mirror it?  If not I now need to model the same topology in two
different ways.

 In that context, breaking out scheduling and "hiding" ironic resources
behind a compute process is going to create more problems than it will
solve, and is not the "Least bad" of the options to me.

-James
[1] http://git.openstack.org/cgit/openstack/gantt/tree/README.rst

On Mon, Dec 14, 2015 at 5:28 PM, Jim Rollenhagen 
wrote:

> On Mon, Dec 14, 2015 at 04:15:42PM -0800, James Penick wrote:
> > I'm very much against it.
> >
> >  In my environment we're going to be depending heavily on the nova
> > scheduler for affinity/anti-affinity of physical datacenter constructs,
> > TOR, Power, etc. Like other operators we need to also have a concept of
> > host aggregates and availability zones for our baremetal as well. If
> these
> > decisions move out of Nova, we'd have to replicate that entire concept of
> > topology inside of the Ironic scheduler. Why do that?
> >
> > I see there are 3 main problems:
> >
> > 1. Resource tracker sucks for Ironic.
> > 2. We need compute host HA
> > 3. We need to schedule compute resources in a consistent way.
> >
> >  We've been exploring options to get rid of RT entirely. However, melwitt
> > suggested out that by improving RT itself, and changing it from a pull
> > model to a push, we skip a lot of these problems. I think it's an
> excellent
> > point. If RT moves to a push model, Ironic can dynamically register nodes
> > as they're added, consumed, claimed, etc and update their state in Nova.
> >
> >  Compute host HA is critical for us, too. However, if the compute hosts
> are
> > not responsible for any complex scheduling behaviors, it becomes much
> > simpler to move the compute hosts to being nothing more than dumb workers
> > selected at random.
> >
> >  With this model, the Nova scheduler can still select compute resources
> in
> > the way that it expects, and deployers can expect to build one system to
> > manage VM and BM. We get rid of RT race conditions, and gain compute HA.
>
> Right, so Deva mentioned this here. Copied from below:
>
> > > > Some folks are asking us to implement a non-virtualization-centric
> > > > scheduler / resource tracker in Nova, or advocating that we wait for
> the
> > > > Nova scheduler to be split-out into a separate project. I do not
> believe
> > > > the Nova team is interested in the former, I do not want to wait for
> the
> > > > latter, and I do not believe that either one will be an adequate
> solution
> > > > -- there are other clients (besides Nova) that need to schedule
> workloads
> > > > on Ironic.
>
> And I totally agree with him. We can rewrite the resource tracker, or we
> can break out the scheduler. That will take years - what do you, as an
> operator, plan to do in the meantime? As an operator of ironic myself,
> I'm willing to eat the pain of figuring out what to do with my
> out-of-tree filters (and cells!), in favor of getting rid of the
> raciness of ClusteredComputeManager in my current deployment. And I'm
> willing to help other operators do the same.
>
> We've been talking about this for close to a year already - we need
> to actually do 

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-15 Thread Clint Byrum
Excerpts from James Penick's message of 2015-12-15 17:19:19 -0800:
> > getting rid of the raciness of ClusteredComputeManager in my
> >current deployment. And I'm willing to help other operators do the same.
> 
>  You do alleviate race, but at the cost of complexity and
> unpredictability.  Breaking that down, let's say we go with the current
> plan and the compute host abstracts hardware specifics from Nova.  The
> compute host will report (sum of resources)/(sum of managed compute).  If
> the hardware beneath that compute host is heterogenous, then the resources
> reported up to nova are not correct, and that really does have significant
> impact on deployers.
> 
>  As an example: Let's say we have 20 nodes behind a compute process.  Half
> of those nodes have 24T of disk, the other have 1T.  An attempt to schedule
> a node with 24T of disk will fail, because Nova scheduler is only aware of
> 12.5T of disk.
> 
>  Ok, so one could argue that you should just run two compute processes per
> type of host (N+1 redundancy).  If you have different raid levels on two
> otherwise identical hosts, you'll now need a new compute process for each
> variant of hardware.  What about host aggregates or availability zones?
> This sounds like an N^2 problem.  A mere 2 host flavors spread across 2
> availability zones means 8 compute processes.
> 
> I have hundreds of hardware flavors, across different security, network,
> and power availability zones.
> 
> >None of this precludes getting to a better world where Gantt actually
> >exists, or the resource tracker works well with Ironic.
> 
> It doesn't preclude it, no. But Gantt is dead[1], and I haven't seen any
> movement to bring it back.
> 
> >It just gets us to an incrementally better model in the meantime.
> 
>  I strongly disagree. Will Ironic manage its own concept of availability
> zones and host aggregates?  What if nova changes their model, will Ironic
> change to mirror it?  If not I now need to model the same topology in two
> different ways.
> 

Yes and yes?

How many matroska dolls can there possibly be in there anyway?

In all seriousness, I don't think it's unreasonable to say that something
that wants to create its own reasonable facsimile of Nova's scheduling
and resource tracking would need to implement the whole interface,
and would in fact need to continue to follow that interface over time.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-14 Thread Joshua Harlow
A question but what filtering/scheduling would be done in which place; 
any thoughts on the breakup between nova and ironic?


If say ironic knows about all baremetal resources and nova doesn't know 
about them, then what kind of decisions can nova make during scheduling 
time? I guess the same question exists for other clustered drivers, what 
decision does nova really make for those types of drivers, is the 
decision beneficial?


I guess the same question connects into various/most filters and how 
they operate with clustered drivers:


For example if nova doesn't know about ironic baremetal resources, how 
does the concept of an availability zone or aggregate, or compute 
enabled/disabled filtering work (all these afaik are connected to 
nova-compute *service* and/or services table, but with this clustering 
model, which nova-compute proxies a request into ironic doesn't seem to 
mean that much).


Anyone compiled (or thought about compiling) a list of concepts from 
nova that *appear to* breakdown when a top level project (nova) doesn't 
know about the resources its child projects (ironic...) contain? (maybe 
an etherpad exists somewhere?)


Dan Smith wrote:

Thanks for summing this up, Deva. The planned solution still gets my
vote; we build that, deprecate the old single compute host model where
nova handles all scheduling, and in the meantime figure out the gaps
that operators need filled and the best way to fill them.


Mine as well, speaking only for myself. It's going to require some
deprecation and transition, but anyone with out-of-tree code (filters,
or otherwise) has to be prepared for that at any moment.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-14 Thread Jim Rollenhagen
On Mon, Dec 14, 2015 at 04:15:42PM -0800, James Penick wrote:
> I'm very much against it.
> 
>  In my environment we're going to be depending heavily on the nova
> scheduler for affinity/anti-affinity of physical datacenter constructs,
> TOR, Power, etc. Like other operators we need to also have a concept of
> host aggregates and availability zones for our baremetal as well. If these
> decisions move out of Nova, we'd have to replicate that entire concept of
> topology inside of the Ironic scheduler. Why do that?
> 
> I see there are 3 main problems:
> 
> 1. Resource tracker sucks for Ironic.
> 2. We need compute host HA
> 3. We need to schedule compute resources in a consistent way.
> 
>  We've been exploring options to get rid of RT entirely. However, melwitt
> suggested out that by improving RT itself, and changing it from a pull
> model to a push, we skip a lot of these problems. I think it's an excellent
> point. If RT moves to a push model, Ironic can dynamically register nodes
> as they're added, consumed, claimed, etc and update their state in Nova.
> 
>  Compute host HA is critical for us, too. However, if the compute hosts are
> not responsible for any complex scheduling behaviors, it becomes much
> simpler to move the compute hosts to being nothing more than dumb workers
> selected at random.
> 
>  With this model, the Nova scheduler can still select compute resources in
> the way that it expects, and deployers can expect to build one system to
> manage VM and BM. We get rid of RT race conditions, and gain compute HA.

Right, so Deva mentioned this here. Copied from below:

> > > Some folks are asking us to implement a non-virtualization-centric
> > > scheduler / resource tracker in Nova, or advocating that we wait for the
> > > Nova scheduler to be split-out into a separate project. I do not believe
> > > the Nova team is interested in the former, I do not want to wait for the
> > > latter, and I do not believe that either one will be an adequate solution
> > > -- there are other clients (besides Nova) that need to schedule workloads
> > > on Ironic.

And I totally agree with him. We can rewrite the resource tracker, or we
can break out the scheduler. That will take years - what do you, as an
operator, plan to do in the meantime? As an operator of ironic myself,
I'm willing to eat the pain of figuring out what to do with my
out-of-tree filters (and cells!), in favor of getting rid of the
raciness of ClusteredComputeManager in my current deployment. And I'm
willing to help other operators do the same.

We've been talking about this for close to a year already - we need
to actually do something. I don't believe we can do this in a
reasonable timeline *and* make everybody (ironic devs, nova devs, and
operators) happy. However, as we said elsewhere in the thread, the old
model will go through a deprecation process, and we can wait to remove
it until we do figure out the path forward for operators like yourself.
Then operators that need out-of-tree filters and the like can keep doing
what they're doing, while they help us (or just wait) to build something
that meets everyone's needs.

None of this precludes getting to a better world where Gaant actually
exists, or the resource tracker works well with Ironic. It just gets us
to an incrementally better model in the meantime.

If someone has a *concrete* proposal (preferably in code) for an alternative
that can be done relatively quickly and also keep everyone happy here, I'm
all ears. But I don't believe one exists at this time, and I'm inclined
to keep rolling forward with what we've got here.

// jim

> 
> -James
> 
> On Thu, Dec 10, 2015 at 4:42 PM, Jim Rollenhagen 
> wrote:
> 
> > On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen wrote:
> > > All,
> > >
> > > I'm going to attempt to summarize a discussion that's been going on for
> > > over a year now, and still remains unresolved.
> > >
> > > TLDR;
> > > 
> > >
> > > The main touch-point between Nova and Ironic continues to be a pain
> > point,
> > > and despite many discussions between the teams over the last year
> > resulting
> > > in a solid proposal, we have not been able to get consensus on a solution
> > > that meets everyone's needs.
> > >
> > > Some folks are asking us to implement a non-virtualization-centric
> > > scheduler / resource tracker in Nova, or advocating that we wait for the
> > > Nova scheduler to be split-out into a separate project. I do not believe
> > > the Nova team is interested in the former, I do not want to wait for the
> > > latter, and I do not believe that either one will be an adequate solution
> > > -- there are other clients (besides Nova) that need to schedule workloads
> > > on Ironic.
> > >
> > > We need to decide on a path of least pain and then proceed. I really want
> > > to get this done in Mitaka.
> > >
> > >
> > > Long version:
> > > -
> > >
> > > During Liberty, Jim and I worked with 

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-14 Thread James Penick
I'm very much against it.

 In my environment we're going to be depending heavily on the nova
scheduler for affinity/anti-affinity of physical datacenter constructs,
TOR, Power, etc. Like other operators we need to also have a concept of
host aggregates and availability zones for our baremetal as well. If these
decisions move out of Nova, we'd have to replicate that entire concept of
topology inside of the Ironic scheduler. Why do that?

I see there are 3 main problems:

1. Resource tracker sucks for Ironic.
2. We need compute host HA
3. We need to schedule compute resources in a consistent way.

 We've been exploring options to get rid of RT entirely. However, melwitt
suggested out that by improving RT itself, and changing it from a pull
model to a push, we skip a lot of these problems. I think it's an excellent
point. If RT moves to a push model, Ironic can dynamically register nodes
as they're added, consumed, claimed, etc and update their state in Nova.

 Compute host HA is critical for us, too. However, if the compute hosts are
not responsible for any complex scheduling behaviors, it becomes much
simpler to move the compute hosts to being nothing more than dumb workers
selected at random.

 With this model, the Nova scheduler can still select compute resources in
the way that it expects, and deployers can expect to build one system to
manage VM and BM. We get rid of RT race conditions, and gain compute HA.

-James

On Thu, Dec 10, 2015 at 4:42 PM, Jim Rollenhagen 
wrote:

> On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen wrote:
> > All,
> >
> > I'm going to attempt to summarize a discussion that's been going on for
> > over a year now, and still remains unresolved.
> >
> > TLDR;
> > 
> >
> > The main touch-point between Nova and Ironic continues to be a pain
> point,
> > and despite many discussions between the teams over the last year
> resulting
> > in a solid proposal, we have not been able to get consensus on a solution
> > that meets everyone's needs.
> >
> > Some folks are asking us to implement a non-virtualization-centric
> > scheduler / resource tracker in Nova, or advocating that we wait for the
> > Nova scheduler to be split-out into a separate project. I do not believe
> > the Nova team is interested in the former, I do not want to wait for the
> > latter, and I do not believe that either one will be an adequate solution
> > -- there are other clients (besides Nova) that need to schedule workloads
> > on Ironic.
> >
> > We need to decide on a path of least pain and then proceed. I really want
> > to get this done in Mitaka.
> >
> >
> > Long version:
> > -
> >
> > During Liberty, Jim and I worked with Jay Pipes and others on the Nova
> team
> > to come up with a plan. That plan was proposed in a Nova spec [1] and
> > approved in October, shortly before the Mitaka summit. It got significant
> > reviews from the Ironic team, since it is predicated on work being done
> in
> > Ironic to expose a new "reservations" API endpoint. The details of that
> > Ironic change were proposed separately [2] but have deadlocked.
> Discussions
> > with some operators at and after the Mitaka summit have highlighted a
> > problem with this plan.
> >
> > Actually, more than one, so to better understand the divergent viewpoints
> > that result in the current deadlock, I drew a diagram [3]. If you haven't
> > read both the Nova and Ironic specs already, this diagram probably won't
> > make sense to you. I'll attempt to explain it a bit with more words.
> >
> >
> > [A]
> > The Nova team wants to remove the (Host, Node) tuple from all the places
> > that this exists, and return to scheduling only based on Compute Host.
> They
> > also don't want to change any existing scheduler filters (especially not
> > compute_capabilities_filter) or the filter scheduler class or plugin
> > mechanisms. And, as far as I understand it, they're not interested in
> > accepting a filter plugin that calls out to external APIs (eg, Ironic) to
> > identify a Node and pass that Node's UUID to the Compute Host.  [[ nova
> > team: please correct me on any point here where I'm wrong, or your
> > collective views have changed over the last year. ]]
> >
> > [B]
> > OpenStack deployers who are using Nova + Ironic rely on a few things:
> > - compute_capabilities_filter to match node.properties['capabilities']
> > against flavor extra_specs.
> > - other downstream nova scheduler filters that do other sorts of hardware
> > matching
> > These deployers clearly and rightly do not want us to take away either of
> > these capabilities, so anything we do needs to be backwards compatible
> with
> > any current Nova scheduler plugins -- even downstream ones.
> >
> > [C] To meet the compatibility requirements of [B] without requiring the
> > nova-scheduler team to do the work, we would need to forklift some parts
> of
> > the nova-scheduler code into Ironic. But I think that's terrible, and 

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-14 Thread Dan Smith
> Thanks for summing this up, Deva. The planned solution still gets my
> vote; we build that, deprecate the old single compute host model where
> nova handles all scheduling, and in the meantime figure out the gaps
> that operators need filled and the best way to fill them.

Mine as well, speaking only for myself. It's going to require some
deprecation and transition, but anyone with out-of-tree code (filters,
or otherwise) has to be prepared for that at any moment.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-10 Thread Jim Rollenhagen
On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen wrote:
> All,
> 
> I'm going to attempt to summarize a discussion that's been going on for
> over a year now, and still remains unresolved.
> 
> TLDR;
> 
> 
> The main touch-point between Nova and Ironic continues to be a pain point,
> and despite many discussions between the teams over the last year resulting
> in a solid proposal, we have not been able to get consensus on a solution
> that meets everyone's needs.
> 
> Some folks are asking us to implement a non-virtualization-centric
> scheduler / resource tracker in Nova, or advocating that we wait for the
> Nova scheduler to be split-out into a separate project. I do not believe
> the Nova team is interested in the former, I do not want to wait for the
> latter, and I do not believe that either one will be an adequate solution
> -- there are other clients (besides Nova) that need to schedule workloads
> on Ironic.
> 
> We need to decide on a path of least pain and then proceed. I really want
> to get this done in Mitaka.
> 
> 
> Long version:
> -
> 
> During Liberty, Jim and I worked with Jay Pipes and others on the Nova team
> to come up with a plan. That plan was proposed in a Nova spec [1] and
> approved in October, shortly before the Mitaka summit. It got significant
> reviews from the Ironic team, since it is predicated on work being done in
> Ironic to expose a new "reservations" API endpoint. The details of that
> Ironic change were proposed separately [2] but have deadlocked. Discussions
> with some operators at and after the Mitaka summit have highlighted a
> problem with this plan.
> 
> Actually, more than one, so to better understand the divergent viewpoints
> that result in the current deadlock, I drew a diagram [3]. If you haven't
> read both the Nova and Ironic specs already, this diagram probably won't
> make sense to you. I'll attempt to explain it a bit with more words.
> 
> 
> [A]
> The Nova team wants to remove the (Host, Node) tuple from all the places
> that this exists, and return to scheduling only based on Compute Host. They
> also don't want to change any existing scheduler filters (especially not
> compute_capabilities_filter) or the filter scheduler class or plugin
> mechanisms. And, as far as I understand it, they're not interested in
> accepting a filter plugin that calls out to external APIs (eg, Ironic) to
> identify a Node and pass that Node's UUID to the Compute Host.  [[ nova
> team: please correct me on any point here where I'm wrong, or your
> collective views have changed over the last year. ]]
> 
> [B]
> OpenStack deployers who are using Nova + Ironic rely on a few things:
> - compute_capabilities_filter to match node.properties['capabilities']
> against flavor extra_specs.
> - other downstream nova scheduler filters that do other sorts of hardware
> matching
> These deployers clearly and rightly do not want us to take away either of
> these capabilities, so anything we do needs to be backwards compatible with
> any current Nova scheduler plugins -- even downstream ones.
> 
> [C] To meet the compatibility requirements of [B] without requiring the
> nova-scheduler team to do the work, we would need to forklift some parts of
> the nova-scheduler code into Ironic. But I think that's terrible, and I
> don't think any OpenStack developers will like it. Furthermore, operators
> have already expressed their distase for this because they want to use the
> same filters for virtual and baremetal instances but do not want to
> duplicate the code (because we all know that's a recipe for drift).
> 
> [D]
> What ever solution we devise for scheduling bare metal resources in Ironic
> needs to perform well at the scale Ironic deployments are aiming for (eg,
> thousands of Nodes) without the use of Cells. It also must be integrable
> with other software (eg, it should be exposed in our REST API). And it must
> allow us to run more than one (active-active) nova-compute process, which
> we can't today.
> 
> 
> OK. That's a lot of words... bear with me, though, as I'm not done yet...
> 
> This drawing [3] is a Venn diagram, but not everything overlaps. The Nova
> and Ironic specs [0],[1] meet the needs of the Nova team and the Ironic
> team, and will provide a more performant, highly-available solution, that
> is easier to use with other schedulers or datacenter-management tools.
> However, this solution does not meet the needs of some current OpenStack
> Operators because it will not support Nova Scheduler filter plugins. Thus,
> in the diagram, [A] and [D] overlap but neither one intersects with [B].
> 
> 
> Summary
> --
> 
> We have proposed a solution that fits ironic's HA model into nova-compute's
> failure domain model, but that's only half of the picture -- in so doing,
> we assumed that scheduling of bare metal resources was simplistic when, in
> fact, it needs to be just as rich as the scheduling of virtual resources.
> 
> 

[openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

2015-12-10 Thread Devananda van der Veen
All,

I'm going to attempt to summarize a discussion that's been going on for
over a year now, and still remains unresolved.

TLDR;


The main touch-point between Nova and Ironic continues to be a pain point,
and despite many discussions between the teams over the last year resulting
in a solid proposal, we have not been able to get consensus on a solution
that meets everyone's needs.

Some folks are asking us to implement a non-virtualization-centric
scheduler / resource tracker in Nova, or advocating that we wait for the
Nova scheduler to be split-out into a separate project. I do not believe
the Nova team is interested in the former, I do not want to wait for the
latter, and I do not believe that either one will be an adequate solution
-- there are other clients (besides Nova) that need to schedule workloads
on Ironic.

We need to decide on a path of least pain and then proceed. I really want
to get this done in Mitaka.


Long version:
-

During Liberty, Jim and I worked with Jay Pipes and others on the Nova team
to come up with a plan. That plan was proposed in a Nova spec [1] and
approved in October, shortly before the Mitaka summit. It got significant
reviews from the Ironic team, since it is predicated on work being done in
Ironic to expose a new "reservations" API endpoint. The details of that
Ironic change were proposed separately [2] but have deadlocked. Discussions
with some operators at and after the Mitaka summit have highlighted a
problem with this plan.

Actually, more than one, so to better understand the divergent viewpoints
that result in the current deadlock, I drew a diagram [3]. If you haven't
read both the Nova and Ironic specs already, this diagram probably won't
make sense to you. I'll attempt to explain it a bit with more words.


[A]
The Nova team wants to remove the (Host, Node) tuple from all the places
that this exists, and return to scheduling only based on Compute Host. They
also don't want to change any existing scheduler filters (especially not
compute_capabilities_filter) or the filter scheduler class or plugin
mechanisms. And, as far as I understand it, they're not interested in
accepting a filter plugin that calls out to external APIs (eg, Ironic) to
identify a Node and pass that Node's UUID to the Compute Host.  [[ nova
team: please correct me on any point here where I'm wrong, or your
collective views have changed over the last year. ]]

[B]
OpenStack deployers who are using Nova + Ironic rely on a few things:
- compute_capabilities_filter to match node.properties['capabilities']
against flavor extra_specs.
- other downstream nova scheduler filters that do other sorts of hardware
matching
These deployers clearly and rightly do not want us to take away either of
these capabilities, so anything we do needs to be backwards compatible with
any current Nova scheduler plugins -- even downstream ones.

[C] To meet the compatibility requirements of [B] without requiring the
nova-scheduler team to do the work, we would need to forklift some parts of
the nova-scheduler code into Ironic. But I think that's terrible, and I
don't think any OpenStack developers will like it. Furthermore, operators
have already expressed their distase for this because they want to use the
same filters for virtual and baremetal instances but do not want to
duplicate the code (because we all know that's a recipe for drift).

[D]
What ever solution we devise for scheduling bare metal resources in Ironic
needs to perform well at the scale Ironic deployments are aiming for (eg,
thousands of Nodes) without the use of Cells. It also must be integrable
with other software (eg, it should be exposed in our REST API). And it must
allow us to run more than one (active-active) nova-compute process, which
we can't today.


OK. That's a lot of words... bear with me, though, as I'm not done yet...

This drawing [3] is a Venn diagram, but not everything overlaps. The Nova
and Ironic specs [0],[1] meet the needs of the Nova team and the Ironic
team, and will provide a more performant, highly-available solution, that
is easier to use with other schedulers or datacenter-management tools.
However, this solution does not meet the needs of some current OpenStack
Operators because it will not support Nova Scheduler filter plugins. Thus,
in the diagram, [A] and [D] overlap but neither one intersects with [B].


Summary
--

We have proposed a solution that fits ironic's HA model into nova-compute's
failure domain model, but that's only half of the picture -- in so doing,
we assumed that scheduling of bare metal resources was simplistic when, in
fact, it needs to be just as rich as the scheduling of virtual resources.

So, at this point, I think we need to accept that the scheduling of
virtualized and bare metal workloads are two different problem domains that
are equally complex.

Either, we:
* build a separate scheduler process in Ironic, forking the Nova scheduler
as a starting point so