Re: problem with degraded boot and systemd

2014-05-21 Thread Duncan
On Tue, 20 May 2014 18:51:26 -0600
Chris Murphy li...@colorremedies.com wrote:

 
 On May 20, 2014, at 6:03 PM, Duncan 1i5t5.dun...@cox.net wrote:
  
  
  I'd actually argue that's functioning as it should, since I see
  forced manual intervention in ordered to mount degraded as a
  FEATURE, NOT A BUG.
 
 Manual intervention is OK for now, when it takes the form of dropping
 to a dracut shell, and only requires the user to pass mount -o
 degraded. To mount degraded automatically is worse because within a
 notification API for user space, it will lead users to make bad
 choices resulting in data loss.
 
 But the needed sequence is fairly burdensome: force shutdown, boot
 again, use rd.break=premount, then use mount -o degraded, and then
 exit a couple of times.

I haven't had the rootfs fail to mount due to that, but every time it
has failed for other reasons[2], I've been dropped to an emergency shell
prompt, from which I could run the mount manually, or do whatever else
I needed to do.  No force shutdown, boot again...  Just do the manual
mount or whatever, exit, and let the boot process continue from where
it errored out and dropped to the emergency shell.

But now that I think about it, I believe that automatic dropping to an
emergency shell when something goes wrong is a dracut option, that I
must have enabled by default.  I can't imagine why anyone would want it
off, thus forcing the reboot and manually adding the rd.break=whatever,
but apparently some folks do, so it's an option.  And I guess if you're
having to do the reboot and add the rd.break manually, you must not
have that option on when you do your dracut initr* builds.

  [1] dracut: I use it here on gentoo as well, because my rootfs is a
  multi-device btrfs and a kernel rootflags=device= line won't parse
  correctly, apparently due to splitting at the wrong =, so I must
  use an initr* despite my preference for a direct initr*-less boot,
  and I use dracut to generate it.
 
 rootflags doesn't take a device argument, it only applies to the
 volume to be mounted at /sysroot, so only one = is needed.

You misunderstand.  To mount a multi-device btrfs, one of two things
must happen to let the kernel know about all the devices.

A) btrfs device scan.

That's userspace, so for a multi-device btrfs
rootfs, it requires an initr* with the btrfs command and something to
trigger it (with dracut it's a udev rule that triggers the btrfs device
scan), before the mount is attempted.

B) btrfs has the device= mount option.

This can be given several times, once for each device in the
multi-device filesystem.

Under normal conditions, the rootflags= kernel commandline option could
thus be used to pass appropriate device= options to be used to mount
the rootfs, thus avoiding the need for an initr* with btrfs device scan
or the device= options passed to a userspace mount.

But trying rootflags=device=/dev/sda5,device=/dev/sdb5,... doesn't
work and the kernel will not mount the filesystem.

But rootflags=degraded works, but then activates the filesystem with
only the single device listed, say root=/dev/sda5, without the other
device.

So obviously rootflags= works since rootflags=degraded works.  But
rootflags=device= does NOT work.  The obvious difference and likely bug
is as I said, the multiple equals, with the kernel commandline parser
apparently trying to parse a parameter called rootflags=device, instead
of a parameter called rootflags, with device= as part of its value.
And of course rootflags=device isn't rootflags, so it doesn't do what
it's supposed to do.

Make more sense now? =:^)

Since I realized the double-equal parsing must be the problem, I've
been going to file a kernel bug on it and hopefully get the kernel
commandline parser fixed.  But apparently I have yet to find an
appropriately rounded tuit, since I've not done so yet. =:^(

FWIW, the btrfs kernel devs are aware that using device= with
rootflags= is broken, as it was one of them that originally mentioned
it to me when I was still asking about things before I had setup my
multi-device btrfs rootfs.  So it's a known issue.  But I'm not sure
they had looked into why, they just knew it didn't work.  And since it
only affects (1) btrfs users who (2) use a multi-device rootfs, *BUT*
(3) do NOT wish to use an initr*, I guess the priority simply hasn't
been high enough for them to investigate further.

So I really should file that bug[3] and get it it to the right people.

---
[2] Back a few dracut versions ago, building the initr* with host-only
would tie the initr* to the UUID of the default rootfs.  As long as
that UUID could be found, the usual root= could be used on the kernel
commandline to boot any other rootfs if desired, and naturally, that's
how I tested my backup, with the main rootfs still there and thus its
UUID available.  But then I renewed my backup, tested again that I
could boot to it using root=, and did a fresh mkfs on the main rootfs,
thus of course killing the UUID 

Re: problem with degraded boot and systemd

2014-05-20 Thread Goffredo Baroncelli
On 05/19/2014 02:54 AM, Chris Murphy wrote:
 Summary:
 
 It's insufficient to pass rootflags=degraded to get the system root
 to mount when a device is missing. It looks like when a device is
 missing, udev doesn't create the dev-disk-by-uuid linkage that then
 causes systemd to change the device state from dead to plugged. Only
 once plugged, will systemd attempt to mount the volume. This issue
 was brought up on systemd-devel under the subject timed out waiting
 for device dev-disk-by\x2duuid for those who want details.
 
[...]
 
 I think the key problem is either a limitation of udev, or a problem
 with the existing udev rule, that prevents the link creation for any
 remaining btrfs device. Or maybe it's intentional. But I'm not a udev
 expert. This is the current udev rule:
 
 # cat /usr/lib/udev/rules.d/64-btrfs.rules 
 # do not edit this file, it will be overwritten on update
 
 SUBSYSTEM!=block, GOTO=btrfs_end ACTION==remove,
 GOTO=btrfs_end ENV{ID_FS_TYPE}!=btrfs, GOTO=btrfs_end
 
 # let the kernel know about this btrfs filesystem, and check if it is 
 complete 
 IMPORT{builtin}=btrfs ready $devnode
 
 # mark the device as not ready to be used by the system 
 ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
 
 LABEL=btrfs_end


The key is the line 

IMPORT{builtin}=btrfs ready $devnode

This line sets ID_BTRFS_READY=0 if a filesystem is not ready; otherwise 
set ID_BTRFS_READY=1 [1].
The next line 

ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0

sets SYSTEMD_READY=0 if the filesystem is not ready so the plug event
is not raised to systemd.

This is my understanding.

 
 
 How this works with raid:
 
 RAID assembly is separate from filesystem mount. The volume UUID
 isn't available until the RAID is successfully assembled.
 
 On at least Fedora (dracut) systems with the system root on an md
 device, the initramfs contains 30-parse-md.sh which includes a loop
 to check for the volume UUID. If it's not found, the script sleeps
 for 0.5 seconds, and then looks for it again, up to 240 times. If
 it's still not found at attempt 240, then the script executes mdadm
 -R to forcibly run the array with fewer than all devices present
 (degraded assembly). Now the volume UUID exists, udevd creates the
 linkage, systemd picks this up and changes device state from dead to
 plugged, and then executes a normal mount command.

 The approximate Btrfs equivalent down the road would be a similar
 initrd script, or maybe a user space daemon, that causes btrfs device
 ready to confirm/deny all devices are present. And after x number of
 failures, then it's issue an equivalent to mdadm -R which right now
 we don't seem to have.

I suggest to implement a mount.btrfs command, which waits all the 
needed disks until a timeout expires. After this timeout it could try
a degraded mount until a second timeout. Only then it fails.

Each time a device appear, the system may start mount.btrfs. Each 
invocation has to test if there is another instance of mount.btrfs related
to the same filesystem; if so it ends, otherwise it follows the above
behavior.


 
 That equivalent might be a decoupling of degraded as a mount option,
 such that the user space tool deals with degradedness. And the mount
[...]
 
 Chris Murphy
G.Baroncelli

[1] 
http://lists.freedesktop.org/archives/systemd-commits/2012-September/002503.html

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem with degraded boot and systemd

2014-05-20 Thread Hugo Mills
On Wed, May 21, 2014 at 12:00:24AM +0200, Goffredo Baroncelli wrote:
 On 05/19/2014 02:54 AM, Chris Murphy wrote:
  Summary:
  
  It's insufficient to pass rootflags=degraded to get the system root
  to mount when a device is missing. It looks like when a device is
  missing, udev doesn't create the dev-disk-by-uuid linkage that then
  causes systemd to change the device state from dead to plugged. Only
  once plugged, will systemd attempt to mount the volume. This issue
  was brought up on systemd-devel under the subject timed out waiting
  for device dev-disk-by\x2duuid for those who want details.
  
 [...]
  
  I think the key problem is either a limitation of udev, or a problem
  with the existing udev rule, that prevents the link creation for any
  remaining btrfs device. Or maybe it's intentional. But I'm not a udev
  expert. This is the current udev rule:
  
  # cat /usr/lib/udev/rules.d/64-btrfs.rules 
  # do not edit this file, it will be overwritten on update
  
  SUBSYSTEM!=block, GOTO=btrfs_end ACTION==remove,
  GOTO=btrfs_end ENV{ID_FS_TYPE}!=btrfs, GOTO=btrfs_end
  
  # let the kernel know about this btrfs filesystem, and check if it is 
  complete 
  IMPORT{builtin}=btrfs ready $devnode
  
  # mark the device as not ready to be used by the system 
  ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
  
  LABEL=btrfs_end
 
 
 The key is the line 
 
   IMPORT{builtin}=btrfs ready $devnode
 
 This line sets ID_BTRFS_READY=0 if a filesystem is not ready; otherwise 
 set ID_BTRFS_READY=1 [1].
 The next line 
 
   ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
 
 sets SYSTEMD_READY=0 if the filesystem is not ready so the plug event
 is not raised to systemd.
 
 This is my understanding.
 
  
  
  How this works with raid:
  
  RAID assembly is separate from filesystem mount. The volume UUID
  isn't available until the RAID is successfully assembled.
  
  On at least Fedora (dracut) systems with the system root on an md
  device, the initramfs contains 30-parse-md.sh which includes a loop
  to check for the volume UUID. If it's not found, the script sleeps
  for 0.5 seconds, and then looks for it again, up to 240 times. If
  it's still not found at attempt 240, then the script executes mdadm
  -R to forcibly run the array with fewer than all devices present
  (degraded assembly). Now the volume UUID exists, udevd creates the
  linkage, systemd picks this up and changes device state from dead to
  plugged, and then executes a normal mount command.
 
  The approximate Btrfs equivalent down the road would be a similar
  initrd script, or maybe a user space daemon, that causes btrfs device
  ready to confirm/deny all devices are present. And after x number of
  failures, then it's issue an equivalent to mdadm -R which right now
  we don't seem to have.
 
 I suggest to implement a mount.btrfs command, which waits all the 
 needed disks until a timeout expires. After this timeout it could try
 a degraded mount until a second timeout. Only then it fails.
 
 Each time a device appear, the system may start mount.btrfs. Each 
 invocation has to test if there is another instance of mount.btrfs related
 to the same filesystem; if so it ends, otherwise it follows the above
 behavior.

   Don't we already have something approaching this functionality with
btrfs device ready? (i.e. this is exactly what it was designed for).

   Hugo.

  That equivalent might be a decoupling of degraded as a mount option,
  such that the user space tool deals with degradedness. And the mount
 [...]
  
  Chris Murphy
 G.Baroncelli
 
 [1] 
 http://lists.freedesktop.org/archives/systemd-commits/2012-September/002503.html
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Putting U back in Honor,  Valor, and Trth ---


signature.asc
Description: Digital signature


Re: problem with degraded boot and systemd

2014-05-20 Thread Duncan
Hugo Mills posted on Tue, 20 May 2014 23:26:09 +0100 as excerpted:

 On Wed, May 21, 2014 at 12:00:24AM +0200, Goffredo Baroncelli wrote:
 On 05/19/2014 02:54 AM, Chris Murphy wrote:
 
 It's insufficient to pass rootflags=degraded to get the system root
 to mount when a device is missing. It looks like when a device is
 missing, udev doesn't [...]
 
 This is the current udev rule:
 
 # cat /usr/lib/udev/rules.d/64-btrfs.rules 
 # do not edit this file, it will be overwritten on update
 
 SUBSYSTEM!=block, GOTO=btrfs_end
 ACTION==remove, GOTO=btrfs_end
 ENV{ID_FS_TYPE}!=btrfs, GOTO=btrfs_end
 
 # let the kernel know about this btrfs filesystem, and check if it is
 # complete
 IMPORT{builtin}=btrfs ready $devnode
 
 # mark the device as not ready to be used by the system
 ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
 
 LABEL=btrfs_end
 
 The key is the line
 
  IMPORT{builtin}=btrfs ready $devnode
 
 This line sets ID_BTRFS_READY=0 if a filesystem is not ready; otherwise
 set ID_BTRFS_READY=1 [1].
 The next line
 
  ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0
 
 sets SYSTEMD_READY=0 if the filesystem is not ready so the plug event
 is not raised to systemd.
 
 This is my understanding.

Looks correct to me. =:^)

 How this works with raid:
 
 RAID assembly is separate from filesystem mount. The volume UUID
 isn't available until the RAID is successfully assembled.
 
 On at least Fedora (dracut) systems with the system root on an md
 device, the initramfs contains 30-parse-md.sh [with a sleep loop and 
 a timeout]
 
 The approximate Btrfs equivalent down the road would be a similar
 initrd script, or maybe a user space daemon, that causes btrfs device
 ready to confirm/deny all devices are present. And after x number of
 failures, then it's issue an equivalent to mdadm -R which right now
 we don't seem to have.
 
 I suggest to implement a mount.btrfs command, which waits all the
 needed disks until a timeout expires. After this timeout it could try a
 degraded mount until a second timeout. Only then it fails.
 
 Each time a device appear, the system may start mount.btrfs. Each
 invocation has to test if there is another instance of mount.btrfs
 related to the same filesystem; if so it ends, otherwise it follows the
 above behavior.
 
 Don't we already have something approaching this functionality with
 btrfs device ready? (i.e. this is exactly what it was designed for).

Well, sort of.

btrfs device ready is used directly in the udev rule quoted above.  And 
in the non-degraded case it works as intended, checking if the filesystem 
is complete and only letting the udev plug event complete when all 
devices are available.

But this thread is about a degraded state mount, with devices missing.  
In that case, the missing devices never appear so the plug event never 
happens, so systemd will never mount the device, despite the fact that 
degraded was specifically passed as an option, indicating that the admin 
wants the mount to happen anyway.

In dracut[1] (on gentoo), the result is an eventual timeout on rootfs 
appearing and a kick to the initr* rescue shell prompt.  Where an admin 
can manually mount using the degraded option, and continue from there.

I'd actually argue that's functioning as it should, since I see forced 
manual intervention in ordered to mount degraded as a FEATURE, NOT A BUG.

But never-the-less, being able to effectively pass degraded either as 
part of rootflags or in the fstab that dracut (and systemd in dracut) 
use, such that degraded-mount could still be automated, could I suppose 
be seen as a feature, to some.

To do that would require a script with a countdown and timeout, first for 
undegraded ready (and thus mount), then if all devices don't appear, 
bypassing the ready test and plugging it anyway, to let mount try it if 
the degraded option was passed, and only if THAT fails falling back to 
the emergency shell prompt.

Note that such a script wouldn't have to actually check for degraded in 
the mount options, only fall back to plugging without all devices if the 
complete timeout triggered, since mount would then take care of success/
failure on its own based on whether the degraded option was passed, just 
as it does if a mount is attempted on an incomplete btrfs at other times.

---
[1] dracut: I use it here on gentoo as well, because my rootfs is a multi-
device btrfs and a kernel rootflags=device= line won't parse correctly, 
apparently due to splitting at the wrong =, so I must use an initr* 
despite my preference for a direct initr*-less boot, and I use dracut to 
generate it.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem with degraded boot and systemd

2014-05-20 Thread Chris Murphy

On May 20, 2014, at 6:03 PM, Duncan 1i5t5.dun...@cox.net wrote:
 
 
 I'd actually argue that's functioning as it should, since I see forced 
 manual intervention in ordered to mount degraded as a FEATURE, NOT A BUG.

Manual intervention is OK for now, when it takes the form of dropping to a 
dracut shell, and only requires the user to pass mount -o degraded. To mount 
degraded automatically is worse because within a notification API for user 
space, it will lead users to make bad choices resulting in data loss.

But the needed sequence is fairly burdensome: force shutdown, boot again, use 
rd.break=premount, then use mount -o degraded, and then exit a couple of times.


 [1] dracut: I use it here on gentoo as well, because my rootfs is a multi-
 device btrfs and a kernel rootflags=device= line won't parse correctly, 
 apparently due to splitting at the wrong =, so I must use an initr* 
 despite my preference for a direct initr*-less boot, and I use dracut to 
 generate it.

rootflags doesn't take a device argument, it only applies to the volume to be 
mounted at /sysroot, so only one = is needed.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


problem with degraded boot and systemd

2014-05-18 Thread Chris Murphy
Summary:

It's insufficient to pass rootflags=degraded to get the system root to mount 
when a device is missing. It looks like when a device is missing, udev doesn't 
create the dev-disk-by-uuid linkage that then causes systemd to change the 
device state from dead to plugged. Only once plugged, will systemd attempt to 
mount the volume. This issue was brought up on systemd-devel under the subject 
timed out waiting for device dev-disk-by\x2duuid for those who want details.

Work around:

I tested systemd 208-16.fc20, and 212-4.fc21. Both will wait indefinitely for 
dev-disk-by-x2uuid, and fail to drop to a dracut shell for a manual recovery 
attempt. That seems like a bug to me so I filed that here:
https://bugzilla.redhat.com/show_bug.cgi?id=1096910

Therefore, first the system must be forced to shutdown, rebooted with boot 
param rd.break=pre-mount to get to a dracut shell before the wait for root by 
uuid begins. Then:

# mount -o subvol=root,ro,degraded device /sysroot
# exit
# exit

And then it boots normally. Fortunately btrfs fi show works so you can mount 
with -U or with a non-missing /dev device.


What's going on:

Example of a 2 device Btrfs raid1 volume, using sda3 and sdb3.

Since boot parameter root=UUID= is used, systemd is expecting to issue the 
mount command referencing that particular volume UUID. When all devices are 
available, systemd-udevd produces entries like this for each device:

[2.168697] localhost.localdomain systemd-udevd[109]: creating link 
'/dev/disk/by-uuid/9ff63135-ce42-4447-a6de-d7c9b4fb6d66' to '/dev/sda3'
[2.170232] localhost.localdomain systemd-udevd[135]: creating link 
'/dev/disk/by-uuid/9ff63135-ce42-4447-a6de-d7c9b4fb6d66' to '/dev/sdb3'

But when just one device is missing, neither link is created by udev, and 
that's the show stopper. 

When all devices are present, the links are created, and systemd changes the 
dev-disk-by-uuid device from dead to plugged like this:

[2.176280] localhost.localdomain systemd[1]: 
dev-disk-by\x2duuid-9ff63135\x2dce42\x2d4447\x2da6de\x2dd7c9b4fb6d66.device 
changed dead - plugged

And then systemd will initiate the command to mount it.

[2.177501] localhost.localdomain systemd[1]: Job 
dev-disk-by\x2duuid-9ff63135\x2dce42\x2d4447\x2da6de\x2dd7c9b4fb6d66.device/start
 finished, result=done
[2.586488] localhost.localdomain systemd[152]: Executing: /bin/mount 
/dev/disk/by-uuid/9ff63135-ce42-4447-a6de-d7c9b4fb6d66 /sysroot -t auto -o 
ro,ro,subvol=root

I think the key problem is either a limitation of udev, or a problem with the 
existing udev rule, that prevents the link creation for any remaining btrfs 
device. Or maybe it's intentional. But I'm not a udev expert. This is the 
current udev rule:

# cat /usr/lib/udev/rules.d/64-btrfs.rules
# do not edit this file, it will be overwritten on update

SUBSYSTEM!=block, GOTO=btrfs_end
ACTION==remove, GOTO=btrfs_end
ENV{ID_FS_TYPE}!=btrfs, GOTO=btrfs_end

# let the kernel know about this btrfs filesystem, and check if it is complete
IMPORT{builtin}=btrfs ready $devnode

# mark the device as not ready to be used by the system
ENV{ID_BTRFS_READY}==0, ENV{SYSTEMD_READY}=0

LABEL=btrfs_end


How this works with raid:

RAID assembly is separate from filesystem mount. The volume UUID isn't 
available until the RAID is successfully assembled. 

On at least Fedora (dracut) systems with the system root on an md device, the 
initramfs contains 30-parse-md.sh which includes a loop to check for the volume 
UUID. If it's not found, the script sleeps for 0.5 seconds, and then looks for 
it again, up to 240 times. If it's still not found at attempt 240, then the 
script executes mdadm -R to forcibly run the array with fewer than all devices 
present (degraded assembly). Now the volume UUID exists, udevd creates the 
linkage, systemd picks this up and changes device state from dead to plugged, 
and then executes a normal mount command.

The approximate Btrfs equivalent down the road would be a similar initrd 
script, or maybe a user space daemon, that causes btrfs device ready to 
confirm/deny all devices are present. And after x number of failures, then it's 
issue an equivalent to mdadm -R which right now we don't seem to have. 

That equivalent might be a decoupling of degraded as a mount option, such that 
the user space tool deals with degradedness. And the mount command remains a 
normal mount command (without degraded option). For example something like 
btrfs filesystem allowdegraded -U uuid would cause some logic to 
confirm/deny that degraded mounting is even possible, such as having the 
minimum number of devices available. If it succeeds, then btrfs device ready 
will report all devices are in fact present, enable udevd to create the links 
by volume uuid, which then allows systemd to tigger a normal mount command. 
Further, the btrfs allowdegraded command would set appropriate metadata on the 
file system such that a normal mount command will succeed.

Or something