[openstack-dev] [TripleO] Improving Swift deployments with TripleO
Hello everyone, I'd like to improve the Swift deployments done by TripleO. There are a few problems today when deployed with the current defaults: 1. Adding new nodes (or replacing existing nodes) is not possible, because the rings are built locally on each host and a new node doesn't know about the "history" of the rings. Therefore rings might become different on the nodes, and that results in an unusable state eventually. 2. The rings are only using a single device, and it seems that this is just a directory and not a mountpoint with a real device. Therefore data is stored on the root device - even if you have 100TB disk space in the background. If not fixed manually your root device will run out of space eventually. 3. Even if a real disk is mounted in /srv/node, replacing a faulty disk is much more troublesome. Normally you would simply unmount a disk, and then replace the disk sometime later. But because mount_check is set to False in the storage servers data will be written to the root device in the meantime; and when you finally mount the disk again, you can't simply cleanup. 4. In general, it's not possible to change cluster layout (using different zones/regions/partition power/device weight, slowly adding new devices to avoid 25% of the data will be moved immediately when adding new nodes to a small cluster, ...). You could manually manage your rings, but they will be overwritten finally when updating your overcloud. 5. Missing erasure coding support (or storage policies in general) This sounds bad, however most of the current issues can be fixed using customized templates and some tooling to create the rings in advance on the undercloud node. The information about all the devices can be collected from the introspection data, and by using node placement the nodenames in the rings are known in advance if the nodes are not yet powered on. This ensures a consistent ring state, and an operator can modify the rings if needed and to customize the cluster layout. Using some customized templates we can already do the following: - disable rinbguilding on the nodes - create filesystems on the extra blockdevices - copy ringfiles from the undercloud, using pre-built rings - enable mount_check by default - (define storage policies if needed) I started working on a POC using tripleo-quickstart, some custom templates and a small Python tool to build rings based on the introspection data: https://github.com/cschwede/tripleo-swift-ring-tool I'd like to get some feedback on the tool and templates. - Does this make sense to you? - How (and where) could we integrate this upstream? - Templates might be included in tripleo-heat-templates? IMO the most important change would be to avoid overwriting rings on the overcloud. There is a good chance to mess up your cluster if the template to disable ring building isn't used and you already have working rings in place. Same for the mount_check option. I'm curious about your thoughts! Thanks, Christian -- Christian Schwede _ Red Hat GmbH Technopark II, Haus C, Werner-von-Siemens-Ring 11-15, 85630 Grasbrunn, Handelsregister: Amtsgericht Muenchen HRB 153243 Geschaeftsfuehrer: Mark Hegarty, Charlie Peters, Michael Cunningham, Charles Cachera __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On Tue, Aug 02, 2016 at 09:36:45PM +0200, Christian Schwede wrote: > Hello everyone, > > I'd like to improve the Swift deployments done by TripleO. There are a > few problems today when deployed with the current defaults: Thanks for digging into this, I'm aware this has been something of a known-issue for some time, so it's great to see it getting addressed :) Some comments inline; > 1. Adding new nodes (or replacing existing nodes) is not possible, > because the rings are built locally on each host and a new node doesn't > know about the "history" of the rings. Therefore rings might become > different on the nodes, and that results in an unusable state eventually. > > 2. The rings are only using a single device, and it seems that this is > just a directory and not a mountpoint with a real device. Therefore data > is stored on the root device - even if you have 100TB disk space in the > background. If not fixed manually your root device will run out of space > eventually. > > 3. Even if a real disk is mounted in /srv/node, replacing a faulty disk > is much more troublesome. Normally you would simply unmount a disk, and > then replace the disk sometime later. But because mount_check is set to > False in the storage servers data will be written to the root device in > the meantime; and when you finally mount the disk again, you can't > simply cleanup. > > 4. In general, it's not possible to change cluster layout (using > different zones/regions/partition power/device weight, slowly adding new > devices to avoid 25% of the data will be moved immediately when adding > new nodes to a small cluster, ...). You could manually manage your > rings, but they will be overwritten finally when updating your overcloud. > > 5. Missing erasure coding support (or storage policies in general) > > This sounds bad, however most of the current issues can be fixed using > customized templates and some tooling to create the rings in advance on > the undercloud node. > > The information about all the devices can be collected from the > introspection data, and by using node placement the nodenames in the > rings are known in advance if the nodes are not yet powered on. This > ensures a consistent ring state, and an operator can modify the rings if > needed and to customize the cluster layout. > > Using some customized templates we can already do the following: > - disable rinbguilding on the nodes > - create filesystems on the extra blockdevices > - copy ringfiles from the undercloud, using pre-built rings > - enable mount_check by default > - (define storage policies if needed) > > I started working on a POC using tripleo-quickstart, some custom > templates and a small Python tool to build rings based on the > introspection data: > > https://github.com/cschwede/tripleo-swift-ring-tool > > I'd like to get some feedback on the tool and templates. > > - Does this make sense to you? Yes, I think the basic workflow described should work, and it's good to see that you're passing the ring data via swift as this is consistent with how we already pass some data to nodes via our DeployArtifacts interface: https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/deploy-artifacts.yaml Note however that there are no credentials to access the undercloud swift on the nodes, so you'll need to pass a tempurl reference in (which is what we do for deploy artifacts, obviously you will have credentials to create the container & tempurl on the undercloud). One slight concern I have is mandating the use of predictable placement - it'd be nice to think about ways we might avoid that but the undercloud centric approach seems OK for a first pass (in either case I think the delivery via swift will be the same). > - How (and where) could we integrate this upstream? So I think the DeployArtefacts interface may work for this, and we have a helper script that can upload data to swift: https://github.com/openstack/tripleo-common/blob/master/scripts/upload-swift-artifacts This basically pushes a tarball to swift, creates a tempurl, then creates a file ($HOME/.tripleo/environments/deployment-artifacts.yaml) which is automatically read by tripleoclient on deployment. DeployArtifactURLs is already a list, but we'll need to test and confirm we can pass both e.g swift ring data and updated puppet modules at the same time. The part that actually builds the rings on the undercloud will probably need to be created as a custom mistral action: https://github.com/openstack/tripleo-common/tree/master/tripleo_common/actions These are then driven as part of the deployment workflow (although the final workflow where this will wire in hasn't yet landed): https://review.openstack.org/#/c/298732/ > - Templates might be included in tripleo-heat-templates? Yes, although by the look of it there may be few template changes required. If you want to remove the current ringbuilder puppet step completely, you can simply remove OS::TripleO::Serv
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
Thanks Steven for your feedback! Please see my answers inline. On 02.08.16 23:46, Steven Hardy wrote: > On Tue, Aug 02, 2016 at 09:36:45PM +0200, Christian Schwede wrote: >> Hello everyone, >> >> I'd like to improve the Swift deployments done by TripleO. There are a >> few problems today when deployed with the current defaults: > > Thanks for digging into this, I'm aware this has been something of a > known-issue for some time, so it's great to see it getting addressed :) > > Some comments inline; > >> 1. Adding new nodes (or replacing existing nodes) is not possible, >> because the rings are built locally on each host and a new node doesn't >> know about the "history" of the rings. Therefore rings might become >> different on the nodes, and that results in an unusable state eventually. >> >> 2. The rings are only using a single device, and it seems that this is >> just a directory and not a mountpoint with a real device. Therefore data >> is stored on the root device - even if you have 100TB disk space in the >> background. If not fixed manually your root device will run out of space >> eventually. >> >> 3. Even if a real disk is mounted in /srv/node, replacing a faulty disk >> is much more troublesome. Normally you would simply unmount a disk, and >> then replace the disk sometime later. But because mount_check is set to >> False in the storage servers data will be written to the root device in >> the meantime; and when you finally mount the disk again, you can't >> simply cleanup. >> >> 4. In general, it's not possible to change cluster layout (using >> different zones/regions/partition power/device weight, slowly adding new >> devices to avoid 25% of the data will be moved immediately when adding >> new nodes to a small cluster, ...). You could manually manage your >> rings, but they will be overwritten finally when updating your overcloud. >> >> 5. Missing erasure coding support (or storage policies in general) >> >> This sounds bad, however most of the current issues can be fixed using >> customized templates and some tooling to create the rings in advance on >> the undercloud node. >> >> The information about all the devices can be collected from the >> introspection data, and by using node placement the nodenames in the >> rings are known in advance if the nodes are not yet powered on. This >> ensures a consistent ring state, and an operator can modify the rings if >> needed and to customize the cluster layout. >> >> Using some customized templates we can already do the following: >> - disable rinbguilding on the nodes >> - create filesystems on the extra blockdevices >> - copy ringfiles from the undercloud, using pre-built rings >> - enable mount_check by default >> - (define storage policies if needed) >> >> I started working on a POC using tripleo-quickstart, some custom >> templates and a small Python tool to build rings based on the >> introspection data: >> >> https://github.com/cschwede/tripleo-swift-ring-tool >> >> I'd like to get some feedback on the tool and templates. >> >> - Does this make sense to you? > > Yes, I think the basic workflow described should work, and it's good to see > that you're passing the ring data via swift as this is consistent with how > we already pass some data to nodes via our DeployArtifacts interface: > > https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/deploy-artifacts.yaml > > Note however that there are no credentials to access the undercloud swift > on the nodes, so you'll need to pass a tempurl reference in (which is what > we do for deploy artifacts, obviously you will have credentials to create > the container & tempurl on the undercloud). Ah, that's very useful! I updated my POC; makes one less customized template and less code to support in the Python tool. Works as expected! > One slight concern I have is mandating the use of predictable placement - > it'd be nice to think about ways we might avoid that but the undercloud > centric approach seems OK for a first pass (in either case I think the > delivery via swift will be the same). Do you mean the predictable artifact filename? We could just add a randomized prefix to the filename IMO. >> - How (and where) could we integrate this upstream? > > So I think the DeployArtefacts interface may work for this, and we have a > helper script that can upload data to swift: > > https://github.com/openstack/tripleo-common/blob/master/scripts/upload-swift-artifacts > > This basically pushes a tarball to swift, creates a tempurl, then creates a > file ($HOME/.tripleo/environments/deployment-artifacts.yaml) which is > automatically read by tripleoclient on deployment. > > DeployArtifactURLs is already a list, but we'll need to test and confirm we > can pass both e.g swift ring data and updated puppet modules at the same > time. If I see this correct the artifacts are deployed just before Puppet runs; and the Swift rings doesn't affect the Puppet modules, so that should be fine?
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On 08/02/2016 09:36 PM, Christian Schwede wrote: Hello everyone, thanks Christian, I'd like to improve the Swift deployments done by TripleO. There are a few problems today when deployed with the current defaults: 1. Adding new nodes (or replacing existing nodes) is not possible, because the rings are built locally on each host and a new node doesn't know about the "history" of the rings. Therefore rings might become different on the nodes, and that results in an unusable state eventually. one of the ideas for this was to use a tempurl in the undercloud swift where to upload the rings built by a single overcloud node, not by the undercloud so I proposed a new heat resource which would permit us to create a swift tempurl in the undercloud during the deployment https://review.openstack.org/#/c/350707/ if we build the rings on the undercloud we can ignore this and use a mistral action instead, as pointed by Steven the good thing about building rings in the overcloud is that it doesn't force us to have a static node mapping for each and every deployment but it makes hard to cope with heterogeneous environments 2. The rings are only using a single device, and it seems that this is just a directory and not a mountpoint with a real device. Therefore data is stored on the root device - even if you have 100TB disk space in the background. If not fixed manually your root device will run out of space eventually. for the disks instead I am thinking to add a create_resources wrapper in puppet-swift: https://review.openstack.org/#/c/350790 https://review.openstack.org/#/c/350840/ so that we can pass via hieradata per-node swift::storage::disks maps we have a mechanism to push per-node hieradata based on the system uuid, we could extend the tool to capture the nodes (system) uuid and generate per-node maps then, with the above puppet changes and having the per-node map and the rings download url, we could feed them to the templates, replace with an environment the rings building implementation and deploy without further customizations what do you think? -- Giulio Fidente GPG KEY: 08D733BA | IRC: gfidente __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On 04.08.16 10:27, Giulio Fidente wrote: > On 08/02/2016 09:36 PM, Christian Schwede wrote: >> Hello everyone, > > thanks Christian, > >> I'd like to improve the Swift deployments done by TripleO. There are a >> few problems today when deployed with the current defaults: >> >> 1. Adding new nodes (or replacing existing nodes) is not possible, >> because the rings are built locally on each host and a new node doesn't >> know about the "history" of the rings. Therefore rings might become >> different on the nodes, and that results in an unusable state eventually. > > one of the ideas for this was to use a tempurl in the undercloud swift > where to upload the rings built by a single overcloud node, not by the > undercloud > > so I proposed a new heat resource which would permit us to create a > swift tempurl in the undercloud during the deployment > > https://review.openstack.org/#/c/350707/ > > if we build the rings on the undercloud we can ignore this and use a > mistral action instead, as pointed by Steven > > the good thing about building rings in the overcloud is that it doesn't > force us to have a static node mapping for each and every deployment but > it makes hard to cope with heterogeneous environments That's true. However - we still need to collect the device data from all the nodes from the undercloud, push it to at least one overcloud mode, build/update the rings there, push it to the undercloud Swift and use that on all overcloud nodes. Or not? That leaves some room for new inconsistencies IMO: how do we ensure that the overcloud node uses the last ring to start with? Also, ring building has to be limited to one overcloud node, otherwise we might end up with multiple ringbuilding nodes? How can an operator manually modify the rings? The tool to build the rings on the undercloud could be further improved later, for example I'd like to be able to move data to new nodes slowly over time, and also query existing storage servers about the progress. Therefore we need some more functionality than currently available in the ringbuilding part in puppet-swift IMO. I think if we move this step to the undercloud we could solve a lot of these challenges in a consistent way. WDYT? I was also thinking more about the static node mapping and how to avoid this. Could we add a host alias using the node UUIDs? That would never change, it's available from the introspection data and therefore could be used in the rings. http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid >> 2. The rings are only using a single device, and it seems that this is >> just a directory and not a mountpoint with a real device. Therefore data >> is stored on the root device - even if you have 100TB disk space in the >> background. If not fixed manually your root device will run out of space >> eventually. > > for the disks instead I am thinking to add a create_resources wrapper in > puppet-swift: > > https://review.openstack.org/#/c/350790 > https://review.openstack.org/#/c/350840/ > > so that we can pass via hieradata per-node swift::storage::disks maps > > we have a mechanism to push per-node hieradata based on the system uuid, > we could extend the tool to capture the nodes (system) uuid and generate > per-node maps Awesome, thanks Giulio! I will test that today. So the tool could generate the mapping automatically, and we don't need to filter puppet facts on the nodes itself. Nice! > then, with the above puppet changes and having the per-node map and the > rings download url, we could feed them to the templates, replace with an > environment the rings building implementation and deploy without further > customizations > > what do you think? Yes, that sounds like a good plan to me. I'll continue working on the ringbuilder tool for now and see how I integrate this into the Mistral actions. -- Christian __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On 08/04/2016 01:26 PM, Christian Schwede wrote: On 04.08.16 10:27, Giulio Fidente wrote: On 08/02/2016 09:36 PM, Christian Schwede wrote: Hello everyone, thanks Christian, I'd like to improve the Swift deployments done by TripleO. There are a few problems today when deployed with the current defaults: 1. Adding new nodes (or replacing existing nodes) is not possible, because the rings are built locally on each host and a new node doesn't know about the "history" of the rings. Therefore rings might become different on the nodes, and that results in an unusable state eventually. one of the ideas for this was to use a tempurl in the undercloud swift where to upload the rings built by a single overcloud node, not by the undercloud so I proposed a new heat resource which would permit us to create a swift tempurl in the undercloud during the deployment https://review.openstack.org/#/c/350707/ if we build the rings on the undercloud we can ignore this and use a mistral action instead, as pointed by Steven the good thing about building rings in the overcloud is that it doesn't force us to have a static node mapping for each and every deployment but it makes hard to cope with heterogeneous environments That's true. However - we still need to collect the device data from all the nodes from the undercloud, push it to at least one overcloud mode, build/update the rings there, push it to the undercloud Swift and use that on all overcloud nodes. Or not? sure, let's build on the undercloud, when automated with mistral it shouldn't make a big difference for the user I was also thinking more about the static node mapping and how to avoid this. Could we add a host alias using the node UUIDs? That would never change, it's available from the introspection data and therefore could be used in the rings. http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid right, this is the mechanism I wanted to use to proviude per-node disk maps, it's how it works for ceph disks as well 2. The rings are only using a single device, and it seems that this is just a directory and not a mountpoint with a real device. Therefore data is stored on the root device - even if you have 100TB disk space in the background. If not fixed manually your root device will run out of space eventually. for the disks instead I am thinking to add a create_resources wrapper in puppet-swift: https://review.openstack.org/#/c/350790 https://review.openstack.org/#/c/350840/ so that we can pass via hieradata per-node swift::storage::disks maps we have a mechanism to push per-node hieradata based on the system uuid, we could extend the tool to capture the nodes (system) uuid and generate per-node maps Awesome, thanks Giulio! I will test that today. So the tool could generate the mapping automatically, and we don't need to filter puppet facts on the nodes itself. Nice! and we could re-use the same tool to generate the ceph::osds disk maps as well :) -- Giulio Fidente GPG KEY: 08D733BA | IRC: gfidente __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On 04.08.16 15:39, Giulio Fidente wrote: > On 08/04/2016 01:26 PM, Christian Schwede wrote: >> On 04.08.16 10:27, Giulio Fidente wrote: >>> On 08/02/2016 09:36 PM, Christian Schwede wrote: Hello everyone, >>> >>> thanks Christian, >>> I'd like to improve the Swift deployments done by TripleO. There are a few problems today when deployed with the current defaults: 1. Adding new nodes (or replacing existing nodes) is not possible, because the rings are built locally on each host and a new node doesn't know about the "history" of the rings. Therefore rings might become different on the nodes, and that results in an unusable state eventually. >>> >>> one of the ideas for this was to use a tempurl in the undercloud swift >>> where to upload the rings built by a single overcloud node, not by the >>> undercloud >>> >>> so I proposed a new heat resource which would permit us to create a >>> swift tempurl in the undercloud during the deployment >>> >>> https://review.openstack.org/#/c/350707/ >>> >>> if we build the rings on the undercloud we can ignore this and use a >>> mistral action instead, as pointed by Steven >>> >>> the good thing about building rings in the overcloud is that it doesn't >>> force us to have a static node mapping for each and every deployment but >>> it makes hard to cope with heterogeneous environments >> >> That's true. However - we still need to collect the device data from all >> the nodes from the undercloud, push it to at least one overcloud mode, >> build/update the rings there, push it to the undercloud Swift and use >> that on all overcloud nodes. Or not? > > sure, let's build on the undercloud, when automated with mistral it > shouldn't make a big difference for the user > >> I was also thinking more about the static node mapping and how to avoid >> this. Could we add a host alias using the node UUIDs? That would never >> change, it's available from the introspection data and therefore could >> be used in the rings. >> >> http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid >> > > right, this is the mechanism I wanted to use to proviude per-node disk > maps, it's how it works for ceph disks as well I looked into this further and proposed a patch upstream: https://review.openstack.org/358643 This worked fine in my tests, an example /etc/hosts looks like this: http://paste.openstack.org/show/562206/ And based on that patch we could build the Swift rings even if the nodes are down and never been deployed, because the system uuid will never change and is unique. I updated my tripleo-swift-ring-tool and just run a successful deployment with the patch (also using the merged patches from Giulio). Let me know what you think about it - I think with that patch we could integrate the tripleo-swift-ring-tool. -- Christian 2. The rings are only using a single device, and it seems that this is just a directory and not a mountpoint with a real device. Therefore data is stored on the root device - even if you have 100TB disk space in the background. If not fixed manually your root device will run out of space eventually. >>> >>> for the disks instead I am thinking to add a create_resources wrapper in >>> puppet-swift: >>> >>> https://review.openstack.org/#/c/350790 >>> https://review.openstack.org/#/c/350840/ >>> >>> so that we can pass via hieradata per-node swift::storage::disks maps >>> >>> we have a mechanism to push per-node hieradata based on the system uuid, >>> we could extend the tool to capture the nodes (system) uuid and generate >>> per-node maps >> >> Awesome, thanks Giulio! >> >> I will test that today. So the tool could generate the mapping >> automatically, and we don't need to filter puppet facts on the nodes >> itself. Nice! > > and we could re-use the same tool to generate the ceph::osds disk maps > as well :) > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev