Hey James, thanks for starting this thread : it's clear we haven't articulated what we've discussed well enough [it's been a slog building up from the bottom...]
I think we need to set specific goals - latency, HA, diagnostics - before designing scaling approaches makes any sense : we can't achieve things we haven't set out to achieve. For instance, if the entire set of goals was 'support 10K-node overclouds', I think we can do that today with a 2-machine undercloud control plane in full-HA mode. So we need to be really clear : are we talking about scaling, or latency @ scale, or ops @ scale - and identify failure modes we should cater, vs ones we shouldn't cater to (or that are outside of our domain e.g. 'you need a fully end to end multipath network if you want network resiliency'). My vision for TripleO/undercloud and scale in the long term is: - A fully redundant self-healing undercloud - (implies self hosting) - And appropriate anti-affinity aggregates so that common failure domains can be avoided - With a scale-up Heat template that identifies the way to grow capacity - Able to deploy to 1K overcloud in < an hour(*) - And 10K [if we can get a suitable test environment] in < 2 hours So thats sublinear performance degradation as scale increases. For TripleO/overcloud and scale, thats something we need to synthesis best practices from existing deployers - e.g. cells and so on - to deliver K+ scale configurations, but it's fundamentally decoupled from the undercloud: Heat is growing cross-cloud deployment facilities, so if we need multiple undercloud's as a failure mitigation strategy, we can deploy one overcloud across multiple underclouds that way. I'm not convinced we need that complexity though: large network fabrics are completely capable of shipping overcloud images to machines in a couple of seconds per machine... (*): Number pulled out of hat. We'll need to drive it lower over time, but given we need time to check new builds are stable, and live migrate thousands of VMs concurrently across hundreds of hypervisors, I think 1 hour for a 1K node cloud deployment is sufficiently aggressive for now. Now, how to achieve this? The current all-in-one control plane is like that for three key reasons: - small clouds need low-overhead control planes, running 12 or 15 machines to deploy a 3-node overcloud doesn't make sense. - bootstrapping an environment has to start on one machine by definition - we haven't finished enough of the overall plumbing story to be working on the scale-out story in much detail (I'm very interested in where you got the idea that all-nodes-identical was the scaling plan for TripleO - it isn't :)) Our d-i-b elements are already suitable for scaling different components independently - thats why nova and nova-kvm are separate: nova installs the nova software, nova-kvm installs the additional bits for a kvm hypervisor and configures the service to talk to the bus : this is how the overcloud scales. Now that we have reliable all-the-way-to overcloud deployments working in devtest we've started working on the image-based updates (https://etherpad.openstack.org/tripleo-image-updates) which is a necessary precondition to scaling the undercloud control plane - because if you can't update a machines role, it's really much harder to evolve a cluster. The exact design of a scaled cluster isn't pinned down yet : I think we need much more data before we can sensibly do it: both on requirements- whats valuable for deployers - and on the scaling characteristics of nova baremetal/Ironic/keystone etc. All that said, some specific thoughts on the broad approaches you sketched: Running all services on all undercloud nodes would drive a lot of complexity in scale-out : there's a lot of state to migrate to new Galera nodes, for instance. I would hesitate to structure the undercloud like that. I don't really follow some of the discussion in Idea 1 : but scaling out things that need scaling out seems pretty sensible. We have no data suggesting how many thousands machines we'll get per nova baremetal machine at the moment, so it's very hard to say what services will need scaling at what points in time yet : but clearly we need to support it at some scale. OTOH once we scale to 'an entire datacentre' the undercloud doesn't need to scale further : I think having each datacentre be a separate deployment cloud makes a lot of sense. Perhaps if we just turn the discussion around and ask - what do we get if we add node type X to an undercloud; what do we get when we add a new undercloud? and the implications thereof. Firstly, lets talk big picture: N-datacentre clouds. I think the 'build a fabric that clearly exposes performance and failure domains' has been very successful for containing complexity in the fabric and enabling [app] deployers to reason about performance and failure, so we shouldn't try to hide that. If you have two datacentres, that should be two regions, with no shared infrastructure. That immediately implies (at least) one undercloud per datacentre, and separate overclouds too. Do we want IPMI running cross-datacentre? I don't think so - bootstrap each datacentre independently, and once it's running, it's running. So within a datacentre - lets take HP's new Aurora facility http://www.theregister.co.uk/2012/06/15/hp_aurora_data_centre/ - which is perhaps best thought of as effectively subdivided into 5 cells, each with about 9 tennis courts worth of servers :) There are apparently 10kw rated racks, so if we filled it with moonshots we'd get oh 250 servers per rack without running into trouble and what - 20 racks in a tennis court? So thats 20*9*250 or 45K servers per cell, 225K in the whole datacentre. Since each cell is self contained, it would be counterproductive to extend a single overcloud across cells : we need to work with the actual fabric of the DC; instead I think we'd want to treat each cell as a separate DC. That then gives us a goal: support 45K servers in a single 'DC'. Now, IPMI security and so forth : I don't see any security implications in shuttling IPMI cross-rack : IPMI is a secure protocol and if it's not the issues we have are not sending it cross-rack, it's machines in the same rack attacking each other. Additionally, to be able to deploy undercloud machines themselves you need a full-HA nova-baremetal with IPMI access, and you make that massively more complex if you partition just some parts of the network but not all of it : you'd need to model that in nova affinity to ensure you schedule deployment nodes into the right area. This leads me to suggest a very simple design: - one undercloud per fully-reachable-fabric-of-IPMI control. Done :) - we gather data on performance scaling as node counts scales - use that to parameterise how to grow the undercloud control plane for a cloud HTH! -Rob -- Robert Collins <[email protected]> Distinguished Technologist HP Converged Cloud _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
