+1. It would be good to also identify the use cases. Surprised that node should be cleaned up automatically. I would expect that we want it to be a deliberate request from administrator to do. Maybe user when they "return" a node to free pool after baremetal usage. Thanks, Arkady
-----Original Message----- From: Tim Bell [mailto:tim.b...@cern.ch] Sent: Thursday, April 26, 2018 11:17 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [tripleo] ironic automated cleaning by default? How about asking the operators at the summit Forum or asking on openstack-operators to see what the users think? Tim -----Original Message----- From: Ben Nemec <openst...@nemebean.com> Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org> Date: Thursday, 26 April 2018 at 17:39 To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org>, Dmitry Tantsur <dtant...@redhat.com> Subject: Re: [openstack-dev] [tripleo] ironic automated cleaning by default? On 04/26/2018 09:24 AM, Dmitry Tantsur wrote: > Answering to both James and Ben inline. > > On 04/25/2018 05:47 PM, Ben Nemec wrote: >> >> >> On 04/25/2018 10:28 AM, James Slagle wrote: >>> On Wed, Apr 25, 2018 at 10:55 AM, Dmitry Tantsur >>> <dtant...@redhat.com> wrote: >>>> On 04/25/2018 04:26 PM, James Slagle wrote: >>>>> >>>>> On Wed, Apr 25, 2018 at 9:14 AM, Dmitry Tantsur <dtant...@redhat.com> >>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I'd like to restart conversation on enabling node automated >>>>>> cleaning by >>>>>> default for the undercloud. This process wipes partitioning tables >>>>>> (optionally, all the data) from overcloud nodes each time they >>>>>> move to >>>>>> "available" state (i.e. on initial enrolling and after each tear >>>>>> down). >>>>>> >>>>>> We have had it disabled for a few reasons: >>>>>> - it was not possible to skip time-consuming wiping if data from >>>>>> disks >>>>>> - the way our workflows used to work required going between >>>>>> manageable >>>>>> and >>>>>> available steps several times >>>>>> >>>>>> However, having cleaning disabled has several issues: >>>>>> - a configdrive left from a previous deployment may confuse >>>>>> cloud-init >>>>>> - a bootable partition left from a previous deployment may take >>>>>> precedence >>>>>> in some BIOS >>>>>> - an UEFI boot partition left from a previous deployment is likely to >>>>>> confuse UEFI firmware >>>>>> - apparently ceph does not work correctly without cleaning (I'll >>>>>> defer to >>>>>> the storage team to comment) >>>>>> >>>>>> For these reasons we don't recommend having cleaning disabled, and I >>>>>> propose >>>>>> to re-enable it. >>>>>> >>>>>> It has the following drawbacks: >>>>>> - The default workflow will require another node boot, thus becoming >>>>>> several >>>>>> minutes longer (incl. the CI) >>>>>> - It will no longer be possible to easily restore a deleted overcloud >>>>>> node. >>>>> >>>>> >>>>> I'm trending towards -1, for these exact reasons you list as >>>>> drawbacks. There has been no shortage of occurrences of users who have >>>>> ended up with accidentally deleted overclouds. These are usually >>>>> caused by user error or unintended/unpredictable Heat operations. >>>>> Until we have a way to guarantee that Heat will never delete a node, >>>>> or Heat is entirely out of the picture for Ironic provisioning, then >>>>> I'd prefer that we didn't enable automated cleaning by default. >>>>> >>>>> I believe we had done something with policy.json at one time to >>>>> prevent node delete, but I don't recall if that protected from both >>>>> user initiated actions and Heat actions. And even that was not enabled >>>>> by default. >>>>> >>>>> IMO, we need to keep "safe" defaults. Even if it means manually >>>>> documenting that you should clean to prevent the issues you point out >>>>> above. The alternative is to have no way to recover deleted nodes by >>>>> default. >>>> >>>> >>>> Well, it's not clear what is "safe" here: protect people who explicitly >>>> delete their stacks or protect people who don't realize that a previous >>>> deployment may screw up their new one in a subtle way. >>> >>> The latter you can recover from, the former you can't if automated >>> cleaning is true. > > Nor can we recover from 'rm -rf / --no-preserve-root', but it's not a > reason to disable the 'rm' command :) > >>> >>> It's not just about people who explicitly delete their stacks (whether >>> intentional or not). There could be user error (non-explicit) or >>> side-effects triggered by Heat that could cause nodes to get deleted. > > If we have problems with Heat, we should fix Heat or stop using it. What > you're saying is essentially "we prevent ironic from doing the right > thing because we're using a tool that can invoke 'rm -rf /' at a wrong > moment." > >>> >>> You couldn't recover from those scenarios if automated cleaning were >>> true. Whereas you could always fix a deployment error by opting in to >>> do an automated clean. Does Ironic keep track of it a node has been >>> previously cleaned? Could we add a validation to check whether any >>> nodes might be used in the deployment that were not previously >>> cleaned? > > It's may be possible possible to figure out if a node was ever cleaned. > But then we'll force operators to invoke cleaning manually, right? It > will work, but that's another step on the default workflow. Are you okay > with it? > >> >> Is there a way to only do cleaning right before a node is deployed? >> If you're about to write a new image to the disk then any data there >> is forfeit anyway. Since the concern is old data on the disk messing >> up subsequent deploys, it doesn't really matter whether you clean it >> right after it's deleted or right before it's deployed, but the latter >> leaves the data intact for longer in case a mistake was made. >> >> If that's not possible then consider this an RFE. :-) > > It's a good idea, but it may cause problems with rebuilding instances. > Rebuild is essentially a re-deploy of the OS, users may not expect the > whole disk to be wiped.. > > Also it's unclear whether we want to write additional features to work > around disabled cleaning. No matter how good the tooling gets, user error will always be a thing. Someone will scale down the wrong node or something similar. I think there's value to allowing recovery from mistakes. We all make them. :-) __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev