Firstly thanks Emilien for starting this discussion, I revised the subject in an effort to get wider feedback, apologies for my delay responding;
On Wed, Sep 09, 2015 at 11:34:26AM -0400, Zane Bitter wrote: > On 24/08/15 15:12, Emilien Macchi wrote: > >Hi, > > > >So I've been working on OpenStack deployments for 4 years now and so far > >RDO Manager is the second installer -after SpinalStack [1]- I'm working on. > > > >SpinalStack already had interested features [2] that allowed us to > >upgrade our customer platforms almost every months, with full testing > >and automation. > > > >Now, we have RDO Manager, I would be happy to share my little experience > >on the topic and help to make it possible in the next cycle. > > > >For that, I created an etherpad [3], which is not too long and focused > >on basic topics for now. This is technical and focused on Infrastructure > >upgrade automation. > > > >Feel free to continue discussion on this thread or directly in the etherpad. > > > >[1] http://spinalstack.enovance.com > >[2] http://spinalstack.enovance.com/en/latest/dev/upgrade.html > >[3] https://etherpad.openstack.org/p/rdo-manager-upgrades > > I added some notes on the etherpad, but I think this discussion poses a > larger question: what is TripleO? Why are we using Heat? Because to me the > major benefit of Heat is that it maintains a record of the current state of > the system that can be used to manage upgrades. And if we're not going to > make use of that - if we're going to determine the state of the system by > introspecting nodes and update it by using Ansible scripts without Heat's > knowledge, then we probably shouldn't be using Heat at all. So, I think we should definitely learn from successful implementations such as SpinalStack's, but given the way TripleO is currently implemented (e.g primarily orchestrating software configuration via Heat), and the philosophy behind the project I think it would be good to focus mostly on *what* needs to be done and not too much on *how* in terms of tooling at this point, and definitely not to assume any up-front requirement for additional CM tooling. The massive part of the value of TripleO IMHO is using OpenStack native tooling whenever possible (even if it means working to improve the tools for all users/use-cases), and I do think (just like orchestrating the initial deployment) this *is* possible via Heat SoftwareDeployments, but there's also an external workflow component, which is likely to be satisfied via tripleo-common (short term) and probably Mistral (longer term). > I'm not saying that to close off the option - I think if Heat is not the > best tool for the job then we should definitely consider other options. And > right now it really is not the best tool for the job. Adopting Puppet (which > was a necessary choice IMO) has meant that the responsibility for what I > call "software orchestration"[1] is split awkwardly between Puppet and Heat. > For example, the Puppet manifests are baked in to images on the servers, so > Heat doesn't know when they've changed and can't retrigger Puppet to update > the configuration when they do. We're left trying to reverse-engineer what > is supposed to be a declarative model from the workflow that we want for > things like updates/upgrades. I don't really agree with this at all tbh - the puppet *modules* are by default distributed in the images, but any update to them is deployed via either an RPM update (which heat detects, provided it's applied via the OS::TripleO::Tasks::PackageUpdate[1] interface, thus puppet *can* be correctly reapplied), or potentially via rsync[2] in future, a unique identifier is all that's required to wire in puppet getting reapplied via NodeConfigIdentifiers[3] [1] https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud-resource-registry-puppet.yaml#L24 [2] https://github.com/openstack/tripleo-heat-templates/blob/master/firstboot/userdata_dev_rsync.yaml [3] https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud-without-mergepy.yaml#L1262 The puppet *manifests* are distributed via heat, so any update to those will trigger heat to reapply the manifest the same as any change to a SoftwareConfig resource config definition. I actually think we've ended up with a pretty clear split in responsibility between puppet and Heat, Heat does the orchestration and puts data in place for to be consumed by puppet, which then owns all aspects of the software configuration. > That said, I think there's still some cause for optimism: in a world where > every service is deployed in a container and every container has its own > Heat SoftwareDeployment, the boundary between Heat's responsibilities and > Puppet's would be much clearer. The deployment could conceivably fit a > declarative model much better, and even offer a lot of flexibility in which > services run on which nodes. We won't really know until we try, but it seems > distinctly possible to aspire toward Heat actually making things easier > rather than just not making them too much harder. And there is stuff on the > long-term roadmap that could be really great if only we had time to devote > to it - for example, as I mentioned in the etherpad, I'd love to get Heat's > user hooks integrated with Mistral so that we could have fully-automated, > highly-available (in a hypothetical future HA undercloud) live migration of > workloads off compute nodes during updates. Yup, definitely as we move closer towards more granular role definitions and particularly container integration, I think the value of the heat declarative model, composability, and built-in integration with other OpenStack services will provide more obvious benefits vs tools geared solely towards software configuration. > In the meantime, however, I do think that we have all the tools in Heat that > we need to cobble together what we need to do. In Liberty, Heat supports > batched rolling updates of ResourceGroups, so we won't need to use user > hooks to cobble together poor-man's batched update support any more. We can > use the user hooks for their intended purpose of notifying the client when > to live-migrate compute workloads off a server that is about to upgraded. > The Heat templates should already tell us exactly which services are running > on which nodes. We can trigger particular software deployments on a stack > update with a parameter value change (as we already do with the yum update > deployment). For operations that happen in isolation on a single server, we > can model them as SoftwareDeployment resources within the individual server > templates. For operations that are synchronised across a group of servers > (e.g. disabling services on the controller nodes in preparation for a DB > migration) we can model them as a SoftwareDeploymentGroup resource in the > parent template. And for chaining multiple sequential operations (e.g. > disable services, migrate database, enable services), we can chain outputs > to inputs to handle both ordering and triggering. I'm sure there will be > many subtleties, but I don't think we *need* Ansible in the mix. +1 - While I get that Ansible is a popular tool, given the current TripleO implementation I don't think it's *needed* to orchestrate updates or upgrades, and there are advantages to keeping the state associated with cluster-wide operations inside Heat. I know from talking with Emilien that one aspect of SpinalStack's update workflow we don't currently capture is the step of determining what is about to be updated, then calculating a workflow associated with e.g restarting services in the right order. It'd be interesting to figure out how that might be wired in via the current Heat model and maybe prototype something which mimics what was done by SpinalStack via Ansible. > So it's really up to the wider TripleO project team to decide which path to > go down. I am genuinely not bothered whether we choose Heat or Ansible. > There may even be ways they can work together without compromising either > model. But I would be pretty uncomfortable with a mix where we use Heat for > deployment and Ansible for doing upgrades behind Heat's back. Perhaps it'd be helpful to work up a couple of specs (or just one which covers both) defining; 1. Strategy for Updates (defined as all incremental updates *not* requiring any changes to DB schema or RPC version, e.g consuming stable-branch updates) 2. How we deal with (and test) Upgrades (e.g moving from Kilo to Liberty, where there are requirements to do DB schema and RPC version changes, and not all services yet support the more advanced models implemented by e.g Nova yet) Cheers, Steve __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev