[Bug 1778704] Re: redeployment of node with bcache fails
** Changed in: curtin Status: Incomplete => Invalid ** Changed in: curtin (Ubuntu) Status: New => Invalid -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
> I assume curtin only does the wipefs when the bcache devices are attached, correct? Yes; curtin probes the disks that are present and activates them as needed. Bcache is activated by the kernel itself if the metadata header is present on any block device. If the disk is not attached to the endpoint when curtin is running, then it cannot clear them. However, I'm not sure that's what we're seeing. Unless somehow the bcache disks are not present during the initial portion of curtin which wipes devices but the disk was somehow delayed (like hotplugged?) curtin should see that there is a bcache device on top of the raided partitions and it will stop and wipe the bcache device. The error message you see happens after wiping is complete and curtin expects that a /dev/bcache0 should NOT already exist when it's attempting to create a new bcache. Curtin validates that /dev/bcacheN is composed of the devices in its configuration and when it encounters a bcache backing device before it issues the creation command it fails with a Runtime error. The only way I can think this would occur is if the devices are not yet available at the time we start clearing the devices. If the raid array were damaged and not yet complete, it's possible that the bcache metadata is not yet restored, so initially we only have the raid arrays; when they finish rebuilding then the bcache layer may get activated at that point and we would not have wiped the bcache metadata. However, I do not think that curtin would allow this as it would have stopped the raid arrays and wiped contents of the composed device (/dev/md0, etc) as well as the underlying devices. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
Ryan, I did some more testing and seems the issue originates from how I do storage preparation, using the vendor's proprietary tool, during commissioning I reconfigure all the RAID controllers, during which the drives get disconnected and re-plugged later. This results that curtin does not see the bcache drives as attached, but the physical layout on the drives remain. I assume curtin only does the wipefs when the bcache devices are attached, correct? What I did for now, maybe just a workaround, but to wipe the drives during the drive configuration commission script. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1778704] Re: redeployment of node with bcache fails
On Fri, Jun 29, 2018 at 11:20 AM Gábor Mészáros wrote: > > That I can later, not now. > What I see now though is mds are not clean and resync started: Yes; the updated curtin does a number of things to prevent this from happening. 1) we wipe the first and last 1M of the raid device itself 2) we fail each member of the array, forcing them out 3) we wipe the first and last 1M of the disk in the array 4) we stop and remove the md device I've seen this exact scenario when we updated curtin to handle this case; so I'm keenly interested in seeing the install log to see what exactly happened. I've taken the storage config you've provided and attempted to reproduce but I'm not able to at this time. In the curtin install log, I'm specifically looking for the device tree and shutdown plan. [ 297.990094] cloud-init[1503]: Current device storage tree: [ 297.990508] cloud-init[1503]: vda [ 297.991362] cloud-init[1503]: `-- vda1 [ 297.991922] cloud-init[1503]: `-- md2 [ 297.992729] cloud-init[1503]: `-- bcache0 [ 297.993098] cloud-init[1503]: vdb [ 297.993678] cloud-init[1503]: `-- md2 [ 297.996101] cloud-init[1503]: `-- bcache0 [ 297.996425] cloud-init[1503]: vdc [ 297.997004] cloud-init[1503]: `-- md2 [ 297.997293] cloud-init[1503]: `-- bcache0 [ 297.997694] cloud-init[1503]: vdd [ 297.998112] cloud-init[1503]: `-- md2 [ 297.998511] cloud-init[1503]: `-- bcache0 [ 297.998797] cloud-init[1503]: vde [ 297.999382] cloud-init[1503]: |-- vde1 [ 297.999805] cloud-init[1503]: |-- vde2 [ 298.000298] cloud-init[1503]: | `-- md0 [ 298.002438] cloud-init[1503]: `-- vde3 [ 298.002718] cloud-init[1503]: `-- md1 [ 298.003113] cloud-init[1503]: `-- bcache0 [ 298.003760] cloud-init[1503]: vdf [ 298.004168] cloud-init[1503]: |-- vdf1 [ 298.004576] cloud-init[1503]: |-- vdf2 [ 298.007226] cloud-init[1503]: | `-- md0 [ 298.007661] cloud-init[1503]: `-- vdf3 [ 298.008095] cloud-init[1503]: `-- md1 [ 298.008493] cloud-init[1503]: `-- bcache0 [ 298.009085] cloud-init[1503]: vdg [ 298.009516] cloud-init[1503]: Shutdown Plan: [ 298.011795] cloud-init[1503]: {'device': '/sys/class/block/bcache0', 'level': 3, 'dev_type': 'bcache'} [ 298.012072] cloud-init[1503]: {'device': '/sys/class/block/md0', 'level': 2, 'dev_type': 'raid'} [ 298.012503] cloud-init[1503]: {'device': '/sys/class/block/md2', 'level': 2, 'dev_type': 'raid'} [ 298.012881] cloud-init[1503]: {'device': '/sys/class/block/md1', 'level': 2, 'dev_type': 'raid'} [ 298.016125] cloud-init[1503]: {'device': '/sys/class/block/vdf/vdf1', 'level': 1, 'dev_type': 'partition'} [ 298.018322] cloud-init[1503]: {'device': '/sys/class/block/vdf/vdf3', 'level': 1, 'dev_type': 'partition'} [ 298.020177] cloud-init[1503]: {'device': '/sys/class/block/vde/vde1', 'level': 1, 'dev_type': 'partition'} [ 298.022210] cloud-init[1503]: {'device': '/sys/class/block/vda/vda1', 'level': 1, 'dev_type': 'partition'} [ 298.024264] cloud-init[1503]: {'device': '/sys/class/block/vde/vde3', 'level': 1, 'dev_type': 'partition'} [ 298.026705] cloud-init[1503]: {'device': '/sys/class/block/vdf/vdf2', 'level': 1, 'dev_type': 'partition'} [ 298.028756] cloud-init[1503]: {'device': '/sys/class/block/vde/vde2', 'level': 1, 'dev_type': 'partition'} [ 298.033613] cloud-init[1503]: {'device': '/sys/class/block/vdb', 'level': 0, 'dev_type': 'disk'} [ 298.033821] cloud-init[1503]: {'device': '/sys/class/block/vdg', 'level': 0, 'dev_type': 'disk'} [ 298.034262] cloud-init[1503]: {'device': '/sys/class/block/vda', 'level': 0, 'dev_type': 'disk'} [ 298.034838] cloud-init[1503]: {'device': '/sys/class/block/vdd', 'level': 0, 'dev_type': 'disk'} [ 298.035435] cloud-init[1503]: {'device': '/sys/class/block/vdf', 'level': 0, 'dev_type': 'disk'} [ 298.039619] cloud-init[1503]: {'device': '/sys/class/block/vde', 'level': 0, 'dev_type': 'disk'} [ 298.042112] cloud-init[1503]: {'device': '/sys/class/block/vdc', 'level': 0, 'dev_type': 'disk'} We first stop the bcache device, then proceed to the raid devices, then members of the raid, and then underlying disks. > > Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: RebuildStarted event detected > on md device /dev/md2 > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.213933] md/raid1:md0: not > clean -- starting background reconstruction > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.213936] md/raid1:md0: > active with 2 out of 2 mirrors > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.213968] md0: detected > capacity change from 0 to 1995440128 > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214033] md: resync of RAID > array md0 > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214039] md: minimum > _guaranteed_ speed: 1000 KB/sec/disk. > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214041] md: using maximum > available idle IO bandwidth (but not more than 20 KB/sec) for resync. > Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214051] md: using
[Bug 1778704] Re: redeployment of node with bcache fails
That I can later, not now. What I see now though is mds are not clean and resync started: Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: RebuildStarted event detected on md device /dev/md2 Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.213933] md/raid1:md0: not clean -- starting background reconstruction Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.213936] md/raid1:md0: active with 2 out of 2 mirrors Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.213968] md0: detected capacity change from 0 to 1995440128 Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214033] md: resync of RAID array md0 Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214039] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214041] md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for resync. Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [ 279.214051] md: using 128k window, over a total of 1948672k. Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: NewArray event detected on md device /dev/md0 Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: RebuildStarted event detected on md device /dev/md0 Jun 29 14:42:42 ic-skbrat2-s40pxtg mdadm[2746]: Rebuild51 event detected on md device /dev/md0 Jun 29 14:42:42 ic-skbrat2-s40pxtg mdadm[2746]: NewArray event detected on md device /dev/md1 Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.312105] md: bind Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.312238] md: bind Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.313697] md/raid1:md1: not clean -- starting background reconstruction Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.313701] md/raid1:md1: active with 2 out of 2 mirrors Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.313774] created bitmap (2 pages) for device md1 Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.314044] md1: bitmap initialized from disk: read 1 pages, set 3515 of 3515 bits Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.314138] md1: detected capacity change from 0 to 235862491136 Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.314228] md: delaying resync of md1 until md0 has finished (they share one or more physical units) Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.412570] bcache: bch_journal_replay() journal replay done, 0 keys in 2 entries, seq 78030 Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.437013] bcache: bch_cached_dev_attach() Caching md2 as bcache0 on set 38d7614a-32f6-4e4f-a044-ab0f06434bf4 Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.437033] bcache: register_cache() registered cache device md1 Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.454171] bcache: register_bcache() error opening /dev/md1: device already registered Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.532188] bcache: register_bcache() error opening /dev/md1: device already registered Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.563413] bcache: register_bcache() error opening /dev/md1: device already registered Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.642738] bcache: register_bcache() error opening /dev/md2: device already registered (emitting change event) Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.702291] bcache: register_bcache() error opening /dev/md2: device already registered (emitting change event) Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.748625] bcache: register_bcache() error opening /dev/md1: device already registered Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [ 284.772383] bcache: register_bcache() error opening /dev/md1: device already registered Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: An error occured handling 'bcache0': RuntimeError - ('Unexpected old bcache device: %s', '/dev/md2') Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: ('Unexpected old bcache device: %s', '/dev/md2') Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: curtin: Installation failed with exception: Unexpected error while running command. Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: Command: ['curtin', 'block-meta', 'custom'] -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
Can you provide the curtin install log? ** Changed in: curtin Status: New => Incomplete -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
unfortunately the issue still exist on my deployment. ** Changed in: curtin Status: Incomplete => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
curtin: Installation started. (18.1-17-gae48e86f-0ubuntu1~16.04.1) third party drivers not installed or necessary.t An error occured handling 'bcache0': RuntimeError - ('Unexpected old bcache device: %s', '/dev/md2')A ('Unexpected old bcache device: %s', '/dev/md2')( curtin: Installation failed with exception: Unexpected error while running command.c Command: ['curtin', 'block-meta', 'custom']C Exit code: 3E Reason: -R Stdout: An error occured handling 'bcache0': RuntimeError - ('Unexpected old bcache device: %s', '/dev/md2')S ('Unexpected old bcache device: %s', '/dev/md2') Stderr: ''S -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
Hello, I believe this issue is resolved in this commit[1]. There is a new version of curtin for Xenial which contains the fix. Can you update curtin to the latest? curtin | 18.1-17-gae48e86f-0ubuntu1~16.04.1 | xenial-updates/universe | all -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
The commit: https://git.launchpad.net/curtin/commit/?id=4bf7750b6010f0483b22ff626eab58d6a85bb072 ** Changed in: curtin Status: New => Incomplete -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1778704] Re: redeployment of node with bcache fails
Hi Gabor, This sounds like, if the re-deployment is failing, curtin is not correctly "cleaning" or allowing the reinstallation to work, even though MAAS is sending the wipe: superblock option. I'm marking this as invalid for MAAS, and opening a task for curtin. ** Also affects: curtin Importance: Undecided Status: New ** Also affects: curtin (Ubuntu) Importance: Undecided Status: New ** Changed in: maas Status: New => Invalid -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1778704 Title: redeployment of node with bcache fails To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs