[Bug 1778704] Re: redeployment of node with bcache fails

2018-07-05 Thread Gábor Mészáros
** Changed in: curtin
   Status: Incomplete => Invalid

** Changed in: curtin (Ubuntu)
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-07-02 Thread Ryan Harper
> I assume curtin only does the wipefs when the bcache devices are
attached, correct?

Yes; curtin probes the disks that are present and activates them as
needed.  Bcache is activated by the kernel itself if the metadata header
is present on any block device.

If the disk is not attached to the endpoint when curtin is running, then
it cannot clear them.

However, I'm not sure that's what we're seeing.

Unless somehow the bcache disks are not present during the initial
portion of curtin which wipes devices but the disk was somehow delayed
(like hotplugged?) curtin should see that there is a bcache device on
top of the raided partitions and it will stop and wipe the bcache
device.

The error message you see happens after wiping is complete and curtin
expects that a /dev/bcache0 should NOT already exist when it's
attempting to create a new bcache.  Curtin validates that /dev/bcacheN
is composed of the devices in its configuration and when it encounters a
bcache backing device before it issues the creation command it fails
with a Runtime error.

The only way I can think this would occur is if the devices are not yet
available at the time we start clearing the devices.

If the raid array were damaged and not yet complete, it's possible that
the bcache metadata is not yet restored, so initially we only have the
raid arrays; when they finish rebuilding then the bcache layer may get
activated at that point and we would not have wiped the bcache metadata.

However, I do not think that curtin would allow this as it would have
stopped the raid arrays and wiped contents of the composed device
(/dev/md0, etc) as well as the underlying devices.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-07-02 Thread Gábor Mészáros
Ryan,

I did some more testing and seems the issue originates from how I do
storage preparation, using the vendor's proprietary tool, during
commissioning I reconfigure all the RAID controllers, during which the
drives get disconnected and re-plugged later. This results that curtin
does not see the bcache drives as attached, but the physical layout on
the drives remain.

I assume curtin only does the wipefs when the bcache devices are attached, 
correct?
What I did for now, maybe just a workaround, but to wipe the drives during the 
drive configuration commission script.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1778704] Re: redeployment of node with bcache fails

2018-06-29 Thread Ryan Harper
On Fri, Jun 29, 2018 at 11:20 AM Gábor Mészáros
 wrote:
>
> That I can later, not now.
> What I see now though is mds are not clean and resync started:

Yes;  the updated curtin does a number of things to prevent this from
happening.

1) we wipe the first and last 1M of the raid device itself
2) we fail each  member of the array, forcing them out
3) we wipe the first and last 1M of the disk in the array
4) we stop and remove the md device

I've seen this exact scenario when we updated curtin to handle this case; so I'm
keenly interested in seeing the install log to see what exactly happened.

I've taken the storage config you've provided and attempted to
reproduce but I'm not able to at this time.
In the curtin install log, I'm specifically looking for the device
tree and shutdown plan.

[  297.990094] cloud-init[1503]: Current device storage tree:
[  297.990508] cloud-init[1503]: vda
[  297.991362] cloud-init[1503]: `-- vda1
[  297.991922] cloud-init[1503]: `-- md2
[  297.992729] cloud-init[1503]: `-- bcache0
[  297.993098] cloud-init[1503]: vdb
[  297.993678] cloud-init[1503]: `-- md2
[  297.996101] cloud-init[1503]: `-- bcache0
[  297.996425] cloud-init[1503]: vdc
[  297.997004] cloud-init[1503]: `-- md2
[  297.997293] cloud-init[1503]: `-- bcache0
[  297.997694] cloud-init[1503]: vdd
[  297.998112] cloud-init[1503]: `-- md2
[  297.998511] cloud-init[1503]: `-- bcache0
[  297.998797] cloud-init[1503]: vde
[  297.999382] cloud-init[1503]: |-- vde1
[  297.999805] cloud-init[1503]: |-- vde2
[  298.000298] cloud-init[1503]: |   `-- md0
[  298.002438] cloud-init[1503]: `-- vde3
[  298.002718] cloud-init[1503]: `-- md1
[  298.003113] cloud-init[1503]: `-- bcache0
[  298.003760] cloud-init[1503]: vdf
[  298.004168] cloud-init[1503]: |-- vdf1
[  298.004576] cloud-init[1503]: |-- vdf2
[  298.007226] cloud-init[1503]: |   `-- md0
[  298.007661] cloud-init[1503]: `-- vdf3
[  298.008095] cloud-init[1503]: `-- md1
[  298.008493] cloud-init[1503]: `-- bcache0
[  298.009085] cloud-init[1503]: vdg
[  298.009516] cloud-init[1503]: Shutdown Plan:
[  298.011795] cloud-init[1503]: {'device':
'/sys/class/block/bcache0', 'level': 3, 'dev_type': 'bcache'}
[  298.012072] cloud-init[1503]: {'device': '/sys/class/block/md0',
'level': 2, 'dev_type': 'raid'}
[  298.012503] cloud-init[1503]: {'device': '/sys/class/block/md2',
'level': 2, 'dev_type': 'raid'}
[  298.012881] cloud-init[1503]: {'device': '/sys/class/block/md1',
'level': 2, 'dev_type': 'raid'}
[  298.016125] cloud-init[1503]: {'device':
'/sys/class/block/vdf/vdf1', 'level': 1, 'dev_type': 'partition'}
[  298.018322] cloud-init[1503]: {'device':
'/sys/class/block/vdf/vdf3', 'level': 1, 'dev_type': 'partition'}
[  298.020177] cloud-init[1503]: {'device':
'/sys/class/block/vde/vde1', 'level': 1, 'dev_type': 'partition'}
[  298.022210] cloud-init[1503]: {'device':
'/sys/class/block/vda/vda1', 'level': 1, 'dev_type': 'partition'}
[  298.024264] cloud-init[1503]: {'device':
'/sys/class/block/vde/vde3', 'level': 1, 'dev_type': 'partition'}
[  298.026705] cloud-init[1503]: {'device':
'/sys/class/block/vdf/vdf2', 'level': 1, 'dev_type': 'partition'}
[  298.028756] cloud-init[1503]: {'device':
'/sys/class/block/vde/vde2', 'level': 1, 'dev_type': 'partition'}
[  298.033613] cloud-init[1503]: {'device': '/sys/class/block/vdb',
'level': 0, 'dev_type': 'disk'}
[  298.033821] cloud-init[1503]: {'device': '/sys/class/block/vdg',
'level': 0, 'dev_type': 'disk'}
[  298.034262] cloud-init[1503]: {'device': '/sys/class/block/vda',
'level': 0, 'dev_type': 'disk'}
[  298.034838] cloud-init[1503]: {'device': '/sys/class/block/vdd',
'level': 0, 'dev_type': 'disk'}
[  298.035435] cloud-init[1503]: {'device': '/sys/class/block/vdf',
'level': 0, 'dev_type': 'disk'}
[  298.039619] cloud-init[1503]: {'device': '/sys/class/block/vde',
'level': 0, 'dev_type': 'disk'}
[  298.042112] cloud-init[1503]: {'device': '/sys/class/block/vdc',
'level': 0, 'dev_type': 'disk'}

We first stop the bcache device, then proceed to the raid devices,
then members of the raid, and then underlying disks.


>
> Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: RebuildStarted event detected 
> on md device /dev/md2
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.213933] md/raid1:md0: not 
> clean -- starting background reconstruction
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.213936] md/raid1:md0: 
> active with 2 out of 2 mirrors
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.213968] md0: detected 
> capacity change from 0 to 1995440128
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214033] md: resync of RAID 
> array md0
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214039] md: minimum 
> _guaranteed_  speed: 1000 KB/sec/disk.
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214041] md: using maximum 
> available idle IO bandwidth (but not more than 20 KB/sec) for resync.
> Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214051] md: 

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-29 Thread Gábor Mészáros
That I can later, not now.
What I see now though is mds are not clean and resync started:

Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: RebuildStarted event detected 
on md device /dev/md2
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.213933] md/raid1:md0: not 
clean -- starting background reconstruction
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.213936] md/raid1:md0: active 
with 2 out of 2 mirrors
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.213968] md0: detected 
capacity change from 0 to 1995440128
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214033] md: resync of RAID 
array md0
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214039] md: minimum 
_guaranteed_  speed: 1000 KB/sec/disk.
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214041] md: using maximum 
available idle IO bandwidth (but not more than 20 KB/sec) for resync.
Jun 29 14:42:36 ic-skbrat2-s40pxtg kernel: [  279.214051] md: using 128k 
window, over a total of 1948672k.
Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: NewArray event detected on md 
device /dev/md0
Jun 29 14:42:36 ic-skbrat2-s40pxtg mdadm[2746]: RebuildStarted event detected 
on md device /dev/md0
Jun 29 14:42:42 ic-skbrat2-s40pxtg mdadm[2746]: Rebuild51 event detected on md 
device /dev/md0
Jun 29 14:42:42 ic-skbrat2-s40pxtg mdadm[2746]: NewArray event detected on md 
device /dev/md1
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.312105] md: bind
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.312238] md: bind
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.313697] md/raid1:md1: not 
clean -- starting background reconstruction
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.313701] md/raid1:md1: active 
with 2 out of 2 mirrors
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.313774] created bitmap (2 
pages) for device md1
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.314044] md1: bitmap 
initialized from disk: read 1 pages, set 3515 of 3515 bits
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.314138] md1: detected 
capacity change from 0 to 235862491136
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.314228] md: delaying resync 
of md1 until md0 has finished (they share one or more physical units)
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.412570] bcache: 
bch_journal_replay() journal replay done, 0 keys in 2 entries, seq 78030
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.437013] bcache: 
bch_cached_dev_attach() Caching md2 as bcache0 on set 
38d7614a-32f6-4e4f-a044-ab0f06434bf4
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.437033] bcache: 
register_cache() registered cache device md1
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.454171] bcache: 
register_bcache() error opening /dev/md1: device already registered
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.532188] bcache: 
register_bcache() error opening /dev/md1: device already registered
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.563413] bcache: 
register_bcache() error opening /dev/md1: device already registered
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.642738] bcache: 
register_bcache() error opening /dev/md2: device already registered (emitting 
change event)
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.702291] bcache: 
register_bcache() error opening /dev/md2: device already registered (emitting 
change event)
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.748625] bcache: 
register_bcache() error opening /dev/md1: device already registered
Jun 29 14:42:42 ic-skbrat2-s40pxtg kernel: [  284.772383] bcache: 
register_bcache() error opening /dev/md1: device already registered
Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: An error occured handling 
'bcache0': RuntimeError - ('Unexpected old bcache device: %s', '/dev/md2')
Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: ('Unexpected old bcache 
device: %s', '/dev/md2')
Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: curtin: Installation 
failed with exception: Unexpected error while running command.
Jun 29 14:42:42 ic-skbrat2-s40pxtg cloud-init[4053]: Command: ['curtin', 
'block-meta', 'custom']

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-29 Thread Ryan Harper
Can you provide the curtin install log?

** Changed in: curtin
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-29 Thread Gábor Mészáros
unfortunately the issue still exist on my deployment.

** Changed in: curtin
   Status: Incomplete => New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-29 Thread Gábor Mészáros
curtin: Installation started. (18.1-17-gae48e86f-0ubuntu1~16.04.1)
third party drivers not installed or necessary.t 
An error occured handling 'bcache0': RuntimeError - ('Unexpected old bcache 
device: %s', '/dev/md2')A 
('Unexpected old bcache device: %s', '/dev/md2')( 
curtin: Installation failed with exception: Unexpected error while running 
command.c 
Command: ['curtin', 'block-meta', 'custom']C 
Exit code: 3E 
Reason: -R 
Stdout: An error occured handling 'bcache0': RuntimeError - ('Unexpected old 
bcache device: %s', '/dev/md2')S 
('Unexpected old bcache device: %s', '/dev/md2') 
 
Stderr: ''S

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-26 Thread Ryan Harper
Hello,

I believe this issue is resolved in this commit[1].
There is a new version of curtin for Xenial which contains the fix.  Can you 
update curtin to the latest?

curtin | 18.1-17-gae48e86f-0ubuntu1~16.04.1 | xenial-updates/universe |
all

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-26 Thread Ryan Harper
The commit:

https://git.launchpad.net/curtin/commit/?id=4bf7750b6010f0483b22ff626eab58d6a85bb072

** Changed in: curtin
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778704] Re: redeployment of node with bcache fails

2018-06-26 Thread Andres Rodriguez
Hi Gabor,

This sounds like, if the re-deployment is failing, curtin is not
correctly "cleaning" or allowing the reinstallation to work, even though
MAAS is sending the wipe: superblock option. I'm marking this as invalid
for MAAS, and opening a task for curtin.

** Also affects: curtin
   Importance: Undecided
   Status: New

** Also affects: curtin (Ubuntu)
   Importance: Undecided
   Status: New

** Changed in: maas
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778704

Title:
  redeployment of node with bcache fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1778704/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs