Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

Jim Rollenhagen Wed, 16 Dec 2015 14:27:56 -0800

On Wed, Dec 16, 2015 at 01:51:47PM -0800, James Penick wrote:
> >We actually called out this problem in the Ironic midcycle and the Tokyo
> >summit - we decided to report Ironic's total capacity from each compute
> >host (resulting in over-reporting from Nova), and real capacity (for
> >purposes of reporting, monitoring, whatever) should be fetched by
> >operators from Ironic (IIRC, you specifically were okay with this
> >limitation). This is still wrong, but it's the least wrong of any option
> >(yes, all are wrong in some way). See the spec[1] for more details.
> 
> I do recall that discussion, but the merged spec says:
> 
> "In general, a nova-compute running the Ironic virt driver should expose
> (total resources)/(number of compute services). This allows for resources
> to be
> sharded across multiple compute services without over-reporting resources."
> 
> I agree that what you said via email is Less Awful than what I read on the
> spec (Did I misread it? Am I full of crazy?)


Oh wow, that was totally missed when we figured that problem out. If you
look down a few paragraphs (under what the reservation request looks
like), it's got more correct words. Sorry about that.

This should clear it up: https://review.openstack.org/#/c/258687/

> >We *do* still
> >need to figure out how to handle availability zones or host aggregates,
> >but I expect we would pass along that data to be matched against. I
> >think it would just be metadata on a node. Something like
> >node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> >you. Ditto for host aggregates - add the metadata to ironic to match
> >what's in the host aggregate. I'm honestly not sure what to do about
> >(anti-)affinity filters; we'll need help figuring that out.
> &
> >Right, I didn't mean gantt specifically, but rather "splitting out the
> >scheduler" like folks keep talking about. That's why I said "actually
> >exists". :)
> 
>  I think splitting out the scheduler isn't going to be realistic. My
> feeling is, if Nova is going to fulfill its destiny of being The Compute
> Service, then the scheduler will stay put and the VM pieces will split out
> into another service (Which I think should be named "Seamus" so I can refer
> to it as "The wee baby Seamus").

Sure, that's honestly the best option, but will take even longer. :)

> (re: ironic maintaining host aggregates)
> >Yes, and yes, assuming those things are valuable to our users. The
> >former clearly is, the latter will clearly depend on the change but I
> >expect we will evolve to continue to fit Nova's model of the world
> >(after all, fitting into Nova's model is a huge chunk of what we do, and
> >is exactly what we're trying to do with this work).
> 
> It's a lot easier to fit into the nova model if we just use what's there
> and don't bother trying to replicate it.

The problem is, the Nova model is "one compute service per physical
host". This is actually *much* easier to implement, if you want to run a
compute service per physical host. :)

> >Again, the other solutions I'm seeing that *do* solve more problems are:
> >* Rewrite the resource tracker
> 
> >Do you have an entire team (yes, it will take a relatively large team,
> >especially when you include some cores dedicated to reviewing the code)
> >that can dedicate a couple of development cycles to one of these?
> 
>  We can certainly help.
> 
> >I sure
> >don't. If and when we do, we can move forward on that and deprecate this
> >model, if we find that to be a useful thing to do at that time. Right
> >now, this is the best plan I have, that we can commit to completing in a
> >reasonable timeframe.
> 
> I respect that you're trying to solve the problem we have right now to make
> operators lives Suck Less. But I think that a short term decision made now
> would hurt a lot more later on.

Yeah, I think that's the biggest disagreement here; I don't think we're
blocking any work to make this even better in the future, just taking a
step toward that. It will be extra work to unwind, and I think it's
worth the tradeoff.

// jim

> -James
> 
> On Wed, Dec 16, 2015 at 8:03 AM, Jim Rollenhagen <j...@jimrollenhagen.com>
> wrote:
> 
> > On Tue, Dec 15, 2015 at 05:19:19PM -0800, James Penick wrote:
> > > > getting rid of the raciness of ClusteredComputeManager in my
> > > >current deployment. And I'm willing to help other operators do the same.
> > >
> > >  You do alleviate race, but at the cost of complexity and
> > > unpredictability.  Breaking that down, let's say we go with the current
> > > plan and the compute host abstracts hardware specifics from Nova.  The
> > > compute host will report (sum of resources)/(sum of managed compute).  If
> > > the hardware beneath that compute host is heterogenous, then the
> > resources
> > > reported up to nova are not correct, and that really does have
> > significant
> > > impact on deployers.
> > >
> > >  As an example: Let's say we have 20 nodes behind a compute process.
> > Half
> > > of those nodes have 24T of disk, the other have 1T.  An attempt to
> > schedule
> > > a node with 24T of disk will fail, because Nova scheduler is only aware
> > of
> > > 12.5T of disk.
> >
> > We actually called out this problem in the Ironic midcycle and the Tokyo
> > summit - we decided to report Ironic's total capacity from each compute
> > host (resulting in over-reporting from Nova), and real capacity (for
> > purposes of reporting, monitoring, whatever) should be fetched by
> > operators from Ironic (IIRC, you specifically were okay with this
> > limitation). This is still wrong, but it's the least wrong of any option
> > (yes, all are wrong in some way). See the spec[1] for more details.
> >
> > >  Ok, so one could argue that you should just run two compute processes
> > per
> > > type of host (N+1 redundancy).  If you have different raid levels on two
> > > otherwise identical hosts, you'll now need a new compute process for each
> > > variant of hardware.  What about host aggregates or availability zones?
> > > This sounds like an N^2 problem.  A mere 2 host flavors spread across 2
> > > availability zones means 8 compute processes.
> > >
> > > I have hundreds of hardware flavors, across different security, network,
> > > and power availability zones.
> >
> > Nobody is talking about running a compute per flavor or capability. All
> > compute hosts will be able to handle all ironic nodes. We *do* still
> > need to figure out how to handle availability zones or host aggregates,
> > but I expect we would pass along that data to be matched against. I
> > think it would just be metadata on a node. Something like
> > node.properties['availability_zone'] = 'rackspace-iad-az3' or what have
> > you. Ditto for host aggregates - add the metadata to ironic to match
> > what's in the host aggregate. I'm honestly not sure what to do about
> > (anti-)affinity filters; we'll need help figuring that out.
> >
> > > >None of this precludes getting to a better world where Gantt actually
> > > >exists, or the resource tracker works well with Ironic.
> > >
> > > It doesn't preclude it, no. But Gantt is dead[1], and I haven't seen any
> > > movement to bring it back.
> >
> > Right, I didn't mean gantt specifically, but rather "splitting out the
> > scheduler" like folks keep talking about. That's why I said "actually
> > exists". :)
> >
> > > >It just gets us to an incrementally better model in the meantime.
> > >
> > >  I strongly disagree. Will Ironic manage its own concept of availability
> > > zones and host aggregates?  What if nova changes their model, will Ironic
> > > change to mirror it?  If not I now need to model the same topology in two
> > > different ways.
> >
> > Yes, and yes, assuming those things are valuable to our users. The
> > former clearly is, the latter will clearly depend on the change but I
> > expect we will evolve to continue to fit Nova's model of the world
> > (after all, fitting into Nova's model is a huge chunk of what we do, and
> > is exactly what we're trying to do with this work).
> >
> > >  In that context, breaking out scheduling and "hiding" ironic resources
> > > behind a compute process is going to create more problems than it will
> > > solve, and is not the "Least bad" of the options to me.
> >
> > Again, the other solutions I'm seeing that *do* solve more problems are:
> >
> > * Rewrite the resource tracker
> > * Break out the scheduler into a separate thing
> >
> > Do you have an entire team (yes, it will take a relatively large team,
> > especially when you include some cores dedicated to reviewing the code)
> > that can dedicate a couple of development cycles to one of these? I sure
> > don't. If and when we do, we can move forward on that and deprecate this
> > model, if we find that to be a useful thing to do at that time. Right
> > now, this is the best plan I have, that we can commit to completing in a
> > reasonable timeframe.
> >
> > // jim
> >
> > >
> > > -James
> > > [1] http://git.openstack.org/cgit/openstack/gantt/tree/README.rst
> > >
> > > On Mon, Dec 14, 2015 at 5:28 PM, Jim Rollenhagen <j...@jimrollenhagen.com
> > >
> > > wrote:
> > >
> > > > On Mon, Dec 14, 2015 at 04:15:42PM -0800, James Penick wrote:
> > > > > I'm very much against it.
> > > > >
> > > > >  In my environment we're going to be depending heavily on the nova
> > > > > scheduler for affinity/anti-affinity of physical datacenter
> > constructs,
> > > > > TOR, Power, etc. Like other operators we need to also have a concept
> > of
> > > > > host aggregates and availability zones for our baremetal as well. If
> > > > these
> > > > > decisions move out of Nova, we'd have to replicate that entire
> > concept of
> > > > > topology inside of the Ironic scheduler. Why do that?
> > > > >
> > > > > I see there are 3 main problems:
> > > > >
> > > > > 1. Resource tracker sucks for Ironic.
> > > > > 2. We need compute host HA
> > > > > 3. We need to schedule compute resources in a consistent way.
> > > > >
> > > > >  We've been exploring options to get rid of RT entirely. However,
> > melwitt
> > > > > suggested out that by improving RT itself, and changing it from a
> > pull
> > > > > model to a push, we skip a lot of these problems. I think it's an
> > > > excellent
> > > > > point. If RT moves to a push model, Ironic can dynamically register
> > nodes
> > > > > as they're added, consumed, claimed, etc and update their state in
> > Nova.
> > > > >
> > > > >  Compute host HA is critical for us, too. However, if the compute
> > hosts
> > > > are
> > > > > not responsible for any complex scheduling behaviors, it becomes much
> > > > > simpler to move the compute hosts to being nothing more than dumb
> > workers
> > > > > selected at random.
> > > > >
> > > > >  With this model, the Nova scheduler can still select compute
> > resources
> > > > in
> > > > > the way that it expects, and deployers can expect to build one
> > system to
> > > > > manage VM and BM. We get rid of RT race conditions, and gain compute
> > HA.
> > > >
> > > > Right, so Deva mentioned this here. Copied from below:
> > > >
> > > > > > > Some folks are asking us to implement a
> > non-virtualization-centric
> > > > > > > scheduler / resource tracker in Nova, or advocating that we wait
> > for
> > > > the
> > > > > > > Nova scheduler to be split-out into a separate project. I do not
> > > > believe
> > > > > > > the Nova team is interested in the former, I do not want to wait
> > for
> > > > the
> > > > > > > latter, and I do not believe that either one will be an adequate
> > > > solution
> > > > > > > -- there are other clients (besides Nova) that need to schedule
> > > > workloads
> > > > > > > on Ironic.
> > > >
> > > > And I totally agree with him. We can rewrite the resource tracker, or
> > we
> > > > can break out the scheduler. That will take years - what do you, as an
> > > > operator, plan to do in the meantime? As an operator of ironic myself,
> > > > I'm willing to eat the pain of figuring out what to do with my
> > > > out-of-tree filters (and cells!), in favor of getting rid of the
> > > > raciness of ClusteredComputeManager in my current deployment. And I'm
> > > > willing to help other operators do the same.
> > > >
> > > > We've been talking about this for close to a year already - we need
> > > > to actually do something. I don't believe we can do this in a
> > > > reasonable timeline *and* make everybody (ironic devs, nova devs, and
> > > > operators) happy. However, as we said elsewhere in the thread, the old
> > > > model will go through a deprecation process, and we can wait to remove
> > > > it until we do figure out the path forward for operators like yourself.
> > > > Then operators that need out-of-tree filters and the like can keep
> > doing
> > > > what they're doing, while they help us (or just wait) to build
> > something
> > > > that meets everyone's needs.
> > > >
> > > > None of this precludes getting to a better world where Gaant actually
> > > > exists, or the resource tracker works well with Ironic. It just gets us
> > > > to an incrementally better model in the meantime.
> > > >
> > > > If someone has a *concrete* proposal (preferably in code) for an
> > > > alternative
> > > > that can be done relatively quickly and also keep everyone happy here,
> > I'm
> > > > all ears. But I don't believe one exists at this time, and I'm inclined
> > > > to keep rolling forward with what we've got here.
> > > >
> > > > // jim
> > > >
> > > > >
> > > > > -James
> > > > >
> > > > > On Thu, Dec 10, 2015 at 4:42 PM, Jim Rollenhagen <
> > j...@jimrollenhagen.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen
> > wrote:
> > > > > > > All,
> > > > > > >
> > > > > > > I'm going to attempt to summarize a discussion that's been going
> > on
> > > > for
> > > > > > > over a year now, and still remains unresolved.
> > > > > > >
> > > > > > > TLDR;
> > > > > > > --------
> > > > > > >
> > > > > > > The main touch-point between Nova and Ironic continues to be a
> > pain
> > > > > > point,
> > > > > > > and despite many discussions between the teams over the last year
> > > > > > resulting
> > > > > > > in a solid proposal, we have not been able to get consensus on a
> > > > solution
> > > > > > > that meets everyone's needs.
> > > > > > >
> > > > > > > Some folks are asking us to implement a
> > non-virtualization-centric
> > > > > > > scheduler / resource tracker in Nova, or advocating that we wait
> > for
> > > > the
> > > > > > > Nova scheduler to be split-out into a separate project. I do not
> > > > believe
> > > > > > > the Nova team is interested in the former, I do not want to wait
> > for
> > > > the
> > > > > > > latter, and I do not believe that either one will be an adequate
> > > > solution
> > > > > > > -- there are other clients (besides Nova) that need to schedule
> > > > workloads
> > > > > > > on Ironic.
> > > > > > >
> > > > > > > We need to decide on a path of least pain and then proceed. I
> > really
> > > > want
> > > > > > > to get this done in Mitaka.
> > > > > > >
> > > > > > >
> > > > > > > Long version:
> > > > > > > -----------------
> > > > > > >
> > > > > > > During Liberty, Jim and I worked with Jay Pipes and others on the
> > > > Nova
> > > > > > team
> > > > > > > to come up with a plan. That plan was proposed in a Nova spec
> > [1] and
> > > > > > > approved in October, shortly before the Mitaka summit. It got
> > > > significant
> > > > > > > reviews from the Ironic team, since it is predicated on work
> > being
> > > > done
> > > > > > in
> > > > > > > Ironic to expose a new "reservations" API endpoint. The details
> > of
> > > > that
> > > > > > > Ironic change were proposed separately [2] but have deadlocked.
> > > > > > Discussions
> > > > > > > with some operators at and after the Mitaka summit have
> > highlighted a
> > > > > > > problem with this plan.
> > > > > > >
> > > > > > > Actually, more than one, so to better understand the divergent
> > > > viewpoints
> > > > > > > that result in the current deadlock, I drew a diagram [3]. If you
> > > > haven't
> > > > > > > read both the Nova and Ironic specs already, this diagram
> > probably
> > > > won't
> > > > > > > make sense to you. I'll attempt to explain it a bit with more
> > words.
> > > > > > >
> > > > > > >
> > > > > > > [A]
> > > > > > > The Nova team wants to remove the (Host, Node) tuple from all the
> > > > places
> > > > > > > that this exists, and return to scheduling only based on Compute
> > > > Host.
> > > > > > They
> > > > > > > also don't want to change any existing scheduler filters
> > (especially
> > > > not
> > > > > > > compute_capabilities_filter) or the filter scheduler class or
> > plugin
> > > > > > > mechanisms. And, as far as I understand it, they're not
> > interested in
> > > > > > > accepting a filter plugin that calls out to external APIs (eg,
> > > > Ironic) to
> > > > > > > identify a Node and pass that Node's UUID to the Compute Host.
> > [[
> > > > nova
> > > > > > > team: please correct me on any point here where I'm wrong, or
> > your
> > > > > > > collective views have changed over the last year. ]]
> > > > > > >
> > > > > > > [B]
> > > > > > > OpenStack deployers who are using Nova + Ironic rely on a few
> > things:
> > > > > > > - compute_capabilities_filter to match
> > > > node.properties['capabilities']
> > > > > > > against flavor extra_specs.
> > > > > > > - other downstream nova scheduler filters that do other sorts of
> > > > hardware
> > > > > > > matching
> > > > > > > These deployers clearly and rightly do not want us to take away
> > > > either of
> > > > > > > these capabilities, so anything we do needs to be backwards
> > > > compatible
> > > > > > with
> > > > > > > any current Nova scheduler plugins -- even downstream ones.
> > > > > > >
> > > > > > > [C] To meet the compatibility requirements of [B] without
> > requiring
> > > > the
> > > > > > > nova-scheduler team to do the work, we would need to forklift
> > some
> > > > parts
> > > > > > of
> > > > > > > the nova-scheduler code into Ironic. But I think that's terrible,
> > > > and I
> > > > > > > don't think any OpenStack developers will like it. Furthermore,
> > > > operators
> > > > > > > have already expressed their distase for this because they want
> > to
> > > > use
> > > > > > the
> > > > > > > same filters for virtual and baremetal instances but do not want
> > to
> > > > > > > duplicate the code (because we all know that's a recipe for
> > drift).
> > > > > > >
> > > > > > > [D]
> > > > > > > What ever solution we devise for scheduling bare metal resources
> > in
> > > > > > Ironic
> > > > > > > needs to perform well at the scale Ironic deployments are aiming
> > for
> > > > (eg,
> > > > > > > thousands of Nodes) without the use of Cells. It also must be
> > > > integrable
> > > > > > > with other software (eg, it should be exposed in our REST API).
> > And
> > > > it
> > > > > > must
> > > > > > > allow us to run more than one (active-active) nova-compute
> > process,
> > > > which
> > > > > > > we can't today.
> > > > > > >
> > > > > > >
> > > > > > > OK. That's a lot of words... bear with me, though, as I'm not
> > done
> > > > yet...
> > > > > > >
> > > > > > > This drawing [3] is a Venn diagram, but not everything overlaps.
> > The
> > > > Nova
> > > > > > > and Ironic specs [0],[1] meet the needs of the Nova team and the
> > > > Ironic
> > > > > > > team, and will provide a more performant, highly-available
> > solution,
> > > > that
> > > > > > > is easier to use with other schedulers or datacenter-management
> > > > tools.
> > > > > > > However, this solution does not meet the needs of some current
> > > > OpenStack
> > > > > > > Operators because it will not support Nova Scheduler filter
> > plugins.
> > > > > > Thus,
> > > > > > > in the diagram, [A] and [D] overlap but neither one intersects
> > with
> > > > [B].
> > > > > > >
> > > > > > >
> > > > > > > Summary
> > > > > > > --------------
> > > > > > >
> > > > > > > We have proposed a solution that fits ironic's HA model into
> > > > > > nova-compute's
> > > > > > > failure domain model, but that's only half of the picture -- in
> > so
> > > > doing,
> > > > > > > we assumed that scheduling of bare metal resources was simplistic
> > > > when,
> > > > > > in
> > > > > > > fact, it needs to be just as rich as the scheduling of virtual
> > > > resources.
> > > > > > >
> > > > > > > So, at this point, I think we need to accept that the scheduling
> > of
> > > > > > > virtualized and bare metal workloads are two different problem
> > > > domains
> > > > > > that
> > > > > > > are equally complex.
> > > > > > >
> > > > > > > Either, we:
> > > > > > > * build a separate scheduler process in Ironic, forking the Nova
> > > > > > scheduler
> > > > > > > as a starting point so as to be compatible with existing
> > plugins; or
> > > > > > > * begin building a direct integration between nova-scheduler and
> > > > ironic,
> > > > > > > and create a non-virtualization-centric resource tracker within
> > > > Nova; or
> > > > > > > * proceed with the plan we previously outlined, accept that this
> > > > isn't
> > > > > > > going to be backwards compatible with nova filter plugins, and
> > > > apologize
> > > > > > to
> > > > > > > any operators who rely on the using the same scheduler plugins
> > for
> > > > > > > baremetal and virtual resources; or
> > > > > > > * keep punting on this, bringing pain and suffering to all
> > operators
> > > > of
> > > > > > > bare metal clouds, because nova-compute must be run as exactly
> > one
> > > > > > process
> > > > > > > for all sizes of clouds.
> > > > > >
> > > > > > Thanks for summing this up, Deva. The planned solution still gets
> > my
> > > > > > vote; we build that, deprecate the old single compute host model
> > where
> > > > > > nova handles all scheduling, and in the meantime figure out the
> > gaps
> > > > > > that operators need filled and the best way to fill them. Maybe we
> > can
> > > > > > fill them by the end of the deprecation period (it's going to need
> > to
> > > > be
> > > > > > a couple cycles), or maybe operators that care about these things
> > need
> > > > > > to carry some downstream patches for a bit.
> > > > > >
> > > > > > I'd be curious how many ops out there run ironic with custom
> > scheduler
> > > > > > filters, or rely on the compute capabilities filters. Rackspace
> > has one
> > > > > > out of tree weigher for image caching, but are okay with moving
> > forward
> > > > > > and doing what it takes to move that.
> > > > > >
> > > > > > // jim
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks for reading,
> > > > > > > Devananda
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > [0] Yes, there are some hacks to work around this, but they are
> > bad.
> > > > > > Please
> > > > > > > don't encourage their use.
> > > > > > >
> > > > > > > [1] https://review.openstack.org/#/c/194453/
> > > > > > >
> > > > > > > [2] https://review.openstack.org/#/c/204641/
> > > > > > >
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > >
> > https://drive.google.com/file/d/0Bz_nyJF_YYGZWnZ2dlAyejgtdVU/view?usp=sharing
> > > > > >
> > > > > > >
> > > > > >
> > > >
> > __________________________________________________________________________
> > > > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > > > Unsubscribe:
> > > > > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > > > > > >
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > > > >
> > > > > >
> > > > > >
> > > >
> > __________________________________________________________________________
> > > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > > Unsubscribe:
> > > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > > > >
> > > >
> > > > >
> > > >
> > __________________________________________________________________________
> > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > Unsubscribe:
> > > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > >
> > > >
> > > >
> > __________________________________________________________________________
> > > > OpenStack Development Mailing List (not for usage questions)
> > > > Unsubscribe:
> > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > >
> >
> > >
> > __________________________________________________________________________
> > > OpenStack Development Mailing List (not for usage questions)
> > > Unsubscribe:
> > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Ironic] [Nova] continuing the "multiple compute host" discussion

Reply via email to