On Wed, Dec 16, 2015 at 01:51:47PM -0800, James Penick wrote: > >We actually called out this problem in the Ironic midcycle and the Tokyo > >summit - we decided to report Ironic's total capacity from each compute > >host (resulting in over-reporting from Nova), and real capacity (for > >purposes of reporting, monitoring, whatever) should be fetched by > >operators from Ironic (IIRC, you specifically were okay with this > >limitation). This is still wrong, but it's the least wrong of any option > >(yes, all are wrong in some way). See the spec[1] for more details. > > I do recall that discussion, but the merged spec says: > > "In general, a nova-compute running the Ironic virt driver should expose > (total resources)/(number of compute services). This allows for resources > to be > sharded across multiple compute services without over-reporting resources." > > I agree that what you said via email is Less Awful than what I read on the > spec (Did I misread it? Am I full of crazy?)
Oh wow, that was totally missed when we figured that problem out. If you look down a few paragraphs (under what the reservation request looks like), it's got more correct words. Sorry about that. This should clear it up: https://review.openstack.org/#/c/258687/ > >We *do* still > >need to figure out how to handle availability zones or host aggregates, > >but I expect we would pass along that data to be matched against. I > >think it would just be metadata on a node. Something like > >node.properties['availability_zone'] = 'rackspace-iad-az3' or what have > >you. Ditto for host aggregates - add the metadata to ironic to match > >what's in the host aggregate. I'm honestly not sure what to do about > >(anti-)affinity filters; we'll need help figuring that out. > & > >Right, I didn't mean gantt specifically, but rather "splitting out the > >scheduler" like folks keep talking about. That's why I said "actually > >exists". :) > > I think splitting out the scheduler isn't going to be realistic. My > feeling is, if Nova is going to fulfill its destiny of being The Compute > Service, then the scheduler will stay put and the VM pieces will split out > into another service (Which I think should be named "Seamus" so I can refer > to it as "The wee baby Seamus"). Sure, that's honestly the best option, but will take even longer. :) > (re: ironic maintaining host aggregates) > >Yes, and yes, assuming those things are valuable to our users. The > >former clearly is, the latter will clearly depend on the change but I > >expect we will evolve to continue to fit Nova's model of the world > >(after all, fitting into Nova's model is a huge chunk of what we do, and > >is exactly what we're trying to do with this work). > > It's a lot easier to fit into the nova model if we just use what's there > and don't bother trying to replicate it. The problem is, the Nova model is "one compute service per physical host". This is actually *much* easier to implement, if you want to run a compute service per physical host. :) > >Again, the other solutions I'm seeing that *do* solve more problems are: > >* Rewrite the resource tracker > > >Do you have an entire team (yes, it will take a relatively large team, > >especially when you include some cores dedicated to reviewing the code) > >that can dedicate a couple of development cycles to one of these? > > We can certainly help. > > >I sure > >don't. If and when we do, we can move forward on that and deprecate this > >model, if we find that to be a useful thing to do at that time. Right > >now, this is the best plan I have, that we can commit to completing in a > >reasonable timeframe. > > I respect that you're trying to solve the problem we have right now to make > operators lives Suck Less. But I think that a short term decision made now > would hurt a lot more later on. Yeah, I think that's the biggest disagreement here; I don't think we're blocking any work to make this even better in the future, just taking a step toward that. It will be extra work to unwind, and I think it's worth the tradeoff. // jim > -James > > On Wed, Dec 16, 2015 at 8:03 AM, Jim Rollenhagen <j...@jimrollenhagen.com> > wrote: > > > On Tue, Dec 15, 2015 at 05:19:19PM -0800, James Penick wrote: > > > > getting rid of the raciness of ClusteredComputeManager in my > > > >current deployment. And I'm willing to help other operators do the same. > > > > > > You do alleviate race, but at the cost of complexity and > > > unpredictability. Breaking that down, let's say we go with the current > > > plan and the compute host abstracts hardware specifics from Nova. The > > > compute host will report (sum of resources)/(sum of managed compute). If > > > the hardware beneath that compute host is heterogenous, then the > > resources > > > reported up to nova are not correct, and that really does have > > significant > > > impact on deployers. > > > > > > As an example: Let's say we have 20 nodes behind a compute process. > > Half > > > of those nodes have 24T of disk, the other have 1T. An attempt to > > schedule > > > a node with 24T of disk will fail, because Nova scheduler is only aware > > of > > > 12.5T of disk. > > > > We actually called out this problem in the Ironic midcycle and the Tokyo > > summit - we decided to report Ironic's total capacity from each compute > > host (resulting in over-reporting from Nova), and real capacity (for > > purposes of reporting, monitoring, whatever) should be fetched by > > operators from Ironic (IIRC, you specifically were okay with this > > limitation). This is still wrong, but it's the least wrong of any option > > (yes, all are wrong in some way). See the spec[1] for more details. > > > > > Ok, so one could argue that you should just run two compute processes > > per > > > type of host (N+1 redundancy). If you have different raid levels on two > > > otherwise identical hosts, you'll now need a new compute process for each > > > variant of hardware. What about host aggregates or availability zones? > > > This sounds like an N^2 problem. A mere 2 host flavors spread across 2 > > > availability zones means 8 compute processes. > > > > > > I have hundreds of hardware flavors, across different security, network, > > > and power availability zones. > > > > Nobody is talking about running a compute per flavor or capability. All > > compute hosts will be able to handle all ironic nodes. We *do* still > > need to figure out how to handle availability zones or host aggregates, > > but I expect we would pass along that data to be matched against. I > > think it would just be metadata on a node. Something like > > node.properties['availability_zone'] = 'rackspace-iad-az3' or what have > > you. Ditto for host aggregates - add the metadata to ironic to match > > what's in the host aggregate. I'm honestly not sure what to do about > > (anti-)affinity filters; we'll need help figuring that out. > > > > > >None of this precludes getting to a better world where Gantt actually > > > >exists, or the resource tracker works well with Ironic. > > > > > > It doesn't preclude it, no. But Gantt is dead[1], and I haven't seen any > > > movement to bring it back. > > > > Right, I didn't mean gantt specifically, but rather "splitting out the > > scheduler" like folks keep talking about. That's why I said "actually > > exists". :) > > > > > >It just gets us to an incrementally better model in the meantime. > > > > > > I strongly disagree. Will Ironic manage its own concept of availability > > > zones and host aggregates? What if nova changes their model, will Ironic > > > change to mirror it? If not I now need to model the same topology in two > > > different ways. > > > > Yes, and yes, assuming those things are valuable to our users. The > > former clearly is, the latter will clearly depend on the change but I > > expect we will evolve to continue to fit Nova's model of the world > > (after all, fitting into Nova's model is a huge chunk of what we do, and > > is exactly what we're trying to do with this work). > > > > > In that context, breaking out scheduling and "hiding" ironic resources > > > behind a compute process is going to create more problems than it will > > > solve, and is not the "Least bad" of the options to me. > > > > Again, the other solutions I'm seeing that *do* solve more problems are: > > > > * Rewrite the resource tracker > > * Break out the scheduler into a separate thing > > > > Do you have an entire team (yes, it will take a relatively large team, > > especially when you include some cores dedicated to reviewing the code) > > that can dedicate a couple of development cycles to one of these? I sure > > don't. If and when we do, we can move forward on that and deprecate this > > model, if we find that to be a useful thing to do at that time. Right > > now, this is the best plan I have, that we can commit to completing in a > > reasonable timeframe. > > > > // jim > > > > > > > > -James > > > [1] http://git.openstack.org/cgit/openstack/gantt/tree/README.rst > > > > > > On Mon, Dec 14, 2015 at 5:28 PM, Jim Rollenhagen <j...@jimrollenhagen.com > > > > > > wrote: > > > > > > > On Mon, Dec 14, 2015 at 04:15:42PM -0800, James Penick wrote: > > > > > I'm very much against it. > > > > > > > > > > In my environment we're going to be depending heavily on the nova > > > > > scheduler for affinity/anti-affinity of physical datacenter > > constructs, > > > > > TOR, Power, etc. Like other operators we need to also have a concept > > of > > > > > host aggregates and availability zones for our baremetal as well. If > > > > these > > > > > decisions move out of Nova, we'd have to replicate that entire > > concept of > > > > > topology inside of the Ironic scheduler. Why do that? > > > > > > > > > > I see there are 3 main problems: > > > > > > > > > > 1. Resource tracker sucks for Ironic. > > > > > 2. We need compute host HA > > > > > 3. We need to schedule compute resources in a consistent way. > > > > > > > > > > We've been exploring options to get rid of RT entirely. However, > > melwitt > > > > > suggested out that by improving RT itself, and changing it from a > > pull > > > > > model to a push, we skip a lot of these problems. I think it's an > > > > excellent > > > > > point. If RT moves to a push model, Ironic can dynamically register > > nodes > > > > > as they're added, consumed, claimed, etc and update their state in > > Nova. > > > > > > > > > > Compute host HA is critical for us, too. However, if the compute > > hosts > > > > are > > > > > not responsible for any complex scheduling behaviors, it becomes much > > > > > simpler to move the compute hosts to being nothing more than dumb > > workers > > > > > selected at random. > > > > > > > > > > With this model, the Nova scheduler can still select compute > > resources > > > > in > > > > > the way that it expects, and deployers can expect to build one > > system to > > > > > manage VM and BM. We get rid of RT race conditions, and gain compute > > HA. > > > > > > > > Right, so Deva mentioned this here. Copied from below: > > > > > > > > > > > Some folks are asking us to implement a > > non-virtualization-centric > > > > > > > scheduler / resource tracker in Nova, or advocating that we wait > > for > > > > the > > > > > > > Nova scheduler to be split-out into a separate project. I do not > > > > believe > > > > > > > the Nova team is interested in the former, I do not want to wait > > for > > > > the > > > > > > > latter, and I do not believe that either one will be an adequate > > > > solution > > > > > > > -- there are other clients (besides Nova) that need to schedule > > > > workloads > > > > > > > on Ironic. > > > > > > > > And I totally agree with him. We can rewrite the resource tracker, or > > we > > > > can break out the scheduler. That will take years - what do you, as an > > > > operator, plan to do in the meantime? As an operator of ironic myself, > > > > I'm willing to eat the pain of figuring out what to do with my > > > > out-of-tree filters (and cells!), in favor of getting rid of the > > > > raciness of ClusteredComputeManager in my current deployment. And I'm > > > > willing to help other operators do the same. > > > > > > > > We've been talking about this for close to a year already - we need > > > > to actually do something. I don't believe we can do this in a > > > > reasonable timeline *and* make everybody (ironic devs, nova devs, and > > > > operators) happy. However, as we said elsewhere in the thread, the old > > > > model will go through a deprecation process, and we can wait to remove > > > > it until we do figure out the path forward for operators like yourself. > > > > Then operators that need out-of-tree filters and the like can keep > > doing > > > > what they're doing, while they help us (or just wait) to build > > something > > > > that meets everyone's needs. > > > > > > > > None of this precludes getting to a better world where Gaant actually > > > > exists, or the resource tracker works well with Ironic. It just gets us > > > > to an incrementally better model in the meantime. > > > > > > > > If someone has a *concrete* proposal (preferably in code) for an > > > > alternative > > > > that can be done relatively quickly and also keep everyone happy here, > > I'm > > > > all ears. But I don't believe one exists at this time, and I'm inclined > > > > to keep rolling forward with what we've got here. > > > > > > > > // jim > > > > > > > > > > > > > > -James > > > > > > > > > > On Thu, Dec 10, 2015 at 4:42 PM, Jim Rollenhagen < > > j...@jimrollenhagen.com > > > > > > > > > > wrote: > > > > > > > > > > > On Thu, Dec 10, 2015 at 03:57:59PM -0800, Devananda van der Veen > > wrote: > > > > > > > All, > > > > > > > > > > > > > > I'm going to attempt to summarize a discussion that's been going > > on > > > > for > > > > > > > over a year now, and still remains unresolved. > > > > > > > > > > > > > > TLDR; > > > > > > > -------- > > > > > > > > > > > > > > The main touch-point between Nova and Ironic continues to be a > > pain > > > > > > point, > > > > > > > and despite many discussions between the teams over the last year > > > > > > resulting > > > > > > > in a solid proposal, we have not been able to get consensus on a > > > > solution > > > > > > > that meets everyone's needs. > > > > > > > > > > > > > > Some folks are asking us to implement a > > non-virtualization-centric > > > > > > > scheduler / resource tracker in Nova, or advocating that we wait > > for > > > > the > > > > > > > Nova scheduler to be split-out into a separate project. I do not > > > > believe > > > > > > > the Nova team is interested in the former, I do not want to wait > > for > > > > the > > > > > > > latter, and I do not believe that either one will be an adequate > > > > solution > > > > > > > -- there are other clients (besides Nova) that need to schedule > > > > workloads > > > > > > > on Ironic. > > > > > > > > > > > > > > We need to decide on a path of least pain and then proceed. I > > really > > > > want > > > > > > > to get this done in Mitaka. > > > > > > > > > > > > > > > > > > > > > Long version: > > > > > > > ----------------- > > > > > > > > > > > > > > During Liberty, Jim and I worked with Jay Pipes and others on the > > > > Nova > > > > > > team > > > > > > > to come up with a plan. That plan was proposed in a Nova spec > > [1] and > > > > > > > approved in October, shortly before the Mitaka summit. It got > > > > significant > > > > > > > reviews from the Ironic team, since it is predicated on work > > being > > > > done > > > > > > in > > > > > > > Ironic to expose a new "reservations" API endpoint. The details > > of > > > > that > > > > > > > Ironic change were proposed separately [2] but have deadlocked. > > > > > > Discussions > > > > > > > with some operators at and after the Mitaka summit have > > highlighted a > > > > > > > problem with this plan. > > > > > > > > > > > > > > Actually, more than one, so to better understand the divergent > > > > viewpoints > > > > > > > that result in the current deadlock, I drew a diagram [3]. If you > > > > haven't > > > > > > > read both the Nova and Ironic specs already, this diagram > > probably > > > > won't > > > > > > > make sense to you. I'll attempt to explain it a bit with more > > words. > > > > > > > > > > > > > > > > > > > > > [A] > > > > > > > The Nova team wants to remove the (Host, Node) tuple from all the > > > > places > > > > > > > that this exists, and return to scheduling only based on Compute > > > > Host. > > > > > > They > > > > > > > also don't want to change any existing scheduler filters > > (especially > > > > not > > > > > > > compute_capabilities_filter) or the filter scheduler class or > > plugin > > > > > > > mechanisms. And, as far as I understand it, they're not > > interested in > > > > > > > accepting a filter plugin that calls out to external APIs (eg, > > > > Ironic) to > > > > > > > identify a Node and pass that Node's UUID to the Compute Host. > > [[ > > > > nova > > > > > > > team: please correct me on any point here where I'm wrong, or > > your > > > > > > > collective views have changed over the last year. ]] > > > > > > > > > > > > > > [B] > > > > > > > OpenStack deployers who are using Nova + Ironic rely on a few > > things: > > > > > > > - compute_capabilities_filter to match > > > > node.properties['capabilities'] > > > > > > > against flavor extra_specs. > > > > > > > - other downstream nova scheduler filters that do other sorts of > > > > hardware > > > > > > > matching > > > > > > > These deployers clearly and rightly do not want us to take away > > > > either of > > > > > > > these capabilities, so anything we do needs to be backwards > > > > compatible > > > > > > with > > > > > > > any current Nova scheduler plugins -- even downstream ones. > > > > > > > > > > > > > > [C] To meet the compatibility requirements of [B] without > > requiring > > > > the > > > > > > > nova-scheduler team to do the work, we would need to forklift > > some > > > > parts > > > > > > of > > > > > > > the nova-scheduler code into Ironic. But I think that's terrible, > > > > and I > > > > > > > don't think any OpenStack developers will like it. Furthermore, > > > > operators > > > > > > > have already expressed their distase for this because they want > > to > > > > use > > > > > > the > > > > > > > same filters for virtual and baremetal instances but do not want > > to > > > > > > > duplicate the code (because we all know that's a recipe for > > drift). > > > > > > > > > > > > > > [D] > > > > > > > What ever solution we devise for scheduling bare metal resources > > in > > > > > > Ironic > > > > > > > needs to perform well at the scale Ironic deployments are aiming > > for > > > > (eg, > > > > > > > thousands of Nodes) without the use of Cells. It also must be > > > > integrable > > > > > > > with other software (eg, it should be exposed in our REST API). > > And > > > > it > > > > > > must > > > > > > > allow us to run more than one (active-active) nova-compute > > process, > > > > which > > > > > > > we can't today. > > > > > > > > > > > > > > > > > > > > > OK. That's a lot of words... bear with me, though, as I'm not > > done > > > > yet... > > > > > > > > > > > > > > This drawing [3] is a Venn diagram, but not everything overlaps. > > The > > > > Nova > > > > > > > and Ironic specs [0],[1] meet the needs of the Nova team and the > > > > Ironic > > > > > > > team, and will provide a more performant, highly-available > > solution, > > > > that > > > > > > > is easier to use with other schedulers or datacenter-management > > > > tools. > > > > > > > However, this solution does not meet the needs of some current > > > > OpenStack > > > > > > > Operators because it will not support Nova Scheduler filter > > plugins. > > > > > > Thus, > > > > > > > in the diagram, [A] and [D] overlap but neither one intersects > > with > > > > [B]. > > > > > > > > > > > > > > > > > > > > > Summary > > > > > > > -------------- > > > > > > > > > > > > > > We have proposed a solution that fits ironic's HA model into > > > > > > nova-compute's > > > > > > > failure domain model, but that's only half of the picture -- in > > so > > > > doing, > > > > > > > we assumed that scheduling of bare metal resources was simplistic > > > > when, > > > > > > in > > > > > > > fact, it needs to be just as rich as the scheduling of virtual > > > > resources. > > > > > > > > > > > > > > So, at this point, I think we need to accept that the scheduling > > of > > > > > > > virtualized and bare metal workloads are two different problem > > > > domains > > > > > > that > > > > > > > are equally complex. > > > > > > > > > > > > > > Either, we: > > > > > > > * build a separate scheduler process in Ironic, forking the Nova > > > > > > scheduler > > > > > > > as a starting point so as to be compatible with existing > > plugins; or > > > > > > > * begin building a direct integration between nova-scheduler and > > > > ironic, > > > > > > > and create a non-virtualization-centric resource tracker within > > > > Nova; or > > > > > > > * proceed with the plan we previously outlined, accept that this > > > > isn't > > > > > > > going to be backwards compatible with nova filter plugins, and > > > > apologize > > > > > > to > > > > > > > any operators who rely on the using the same scheduler plugins > > for > > > > > > > baremetal and virtual resources; or > > > > > > > * keep punting on this, bringing pain and suffering to all > > operators > > > > of > > > > > > > bare metal clouds, because nova-compute must be run as exactly > > one > > > > > > process > > > > > > > for all sizes of clouds. > > > > > > > > > > > > Thanks for summing this up, Deva. The planned solution still gets > > my > > > > > > vote; we build that, deprecate the old single compute host model > > where > > > > > > nova handles all scheduling, and in the meantime figure out the > > gaps > > > > > > that operators need filled and the best way to fill them. Maybe we > > can > > > > > > fill them by the end of the deprecation period (it's going to need > > to > > > > be > > > > > > a couple cycles), or maybe operators that care about these things > > need > > > > > > to carry some downstream patches for a bit. > > > > > > > > > > > > I'd be curious how many ops out there run ironic with custom > > scheduler > > > > > > filters, or rely on the compute capabilities filters. Rackspace > > has one > > > > > > out of tree weigher for image caching, but are okay with moving > > forward > > > > > > and doing what it takes to move that. > > > > > > > > > > > > // jim > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for reading, > > > > > > > Devananda > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] Yes, there are some hacks to work around this, but they are > > bad. > > > > > > Please > > > > > > > don't encourage their use. > > > > > > > > > > > > > > [1] https://review.openstack.org/#/c/194453/ > > > > > > > > > > > > > > [2] https://review.openstack.org/#/c/204641/ > > > > > > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > https://drive.google.com/file/d/0Bz_nyJF_YYGZWnZ2dlAyejgtdVU/view?usp=sharing > > > > > > > > > > > > > > > > > > > > > > > > > __________________________________________________________________________ > > > > > > > OpenStack Development Mailing List (not for usage questions) > > > > > > > Unsubscribe: > > > > > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > > > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > > > > > > > > > > > > > > > > > __________________________________________________________________________ > > > > > > OpenStack Development Mailing List (not for usage questions) > > > > > > Unsubscribe: > > > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > > > > > > > > > > > > > > __________________________________________________________________________ > > > > > OpenStack Development Mailing List (not for usage questions) > > > > > Unsubscribe: > > > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > > > > > > > __________________________________________________________________________ > > > > OpenStack Development Mailing List (not for usage questions) > > > > Unsubscribe: > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > > > > __________________________________________________________________________ > > > OpenStack Development Mailing List (not for usage questions) > > > Unsubscribe: > > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > __________________________________________________________________________ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev