Bug#791794: RAID device not active during boot
When all disks are available during boot the system is starting without problems: ls -l /dev/disk/by-uuid/ total 0 lrwxrwxrwx 1 root root 10 Jul 13 18:15 2138f67e-7b9e-4960-80d3-2ac2ce31d882 - ../../sdc2 lrwxrwxrwx 1 root root 10 Jul 13 18:15 21a660eb-729d-48fe-b9e3-140ae0ee79f4 - ../../sdd2 lrwxrwxrwx 1 root root 9 Jul 13 18:15 *c4263f89-eb0c-4372-90ae-ce1a1545613e* - ../../*md0* lrwxrwxrwx 1 root root 10 Jul 13 18:15 cbeaebcb-2c55-48c0-b6bd-d5e8a5c4ac06 - ../../sdb2 lrwxrwxrwx 1 root root 10 Jul 13 18:15 ff2bae51-c5b8-41e3-855b-68ee57b61c0c - ../../sda2 When starting the system with only two (instead of four) disks I'm droped into emergency shell with the following error message: ALERT! /dev/disk/by-uuid/*c4263f89-eb0c-4372-90ae-ce1a1545613e* does not exist. Dropping to a shell! ... which seems to be consistent with the fact that the UUID for /dev/md0 is not available ... (initramfs) ls -l /dev/disk/by-uuid/ total 0 lrwxrwxrwx1 00 10 Jul 13 15:20 cbeaebcb-2c55-48c0-b6bd-d5e8a5c4ac06 - ../../sdb2 lrwxrwxrwx1 00 10 Jul 13 15:20 ff2bae51-c5b8-41e3-855b-68ee57b61c0c - ../../sda2 ... which in turn is caused the RAID device itself being inactive at that time: (initramfs) cat /proc/mdstat Personalities : md0 : *inactive* sdb1[5](S) sda1[6](S) 39028736 blocks super 1.2 unused devices: none In order to re-activate /dev/md0 I use the following commands: (initramfs) mdadm --stop /dev/md0 [ 178.719551] md: md0 stopped. [ 178.722463] md: unbindsdb1 [ 178.725386] md: export_rdev(sdb1) [ 178.728804] md: unbindsda1 [ 178.731711] md: export_rdev(sda1) mdadm: stopped /dev/md0 (initramfs) mdadm --assemble /dev/md0 [ 214.171191] md: md0 stopped. [ 214.184471] md: bindsda1 [ 214.195838] md: bindsdb1 [ 214.218253] md: raid1 personality registered for level 1 [ 214.226156] md/raid1:md0: active with 1 out of 3 mirrors [ 214.231651] md0: detected capacity change from 0 to 19982581760 [ 214.247893] md0: unknown partition table mdadm: /dev/md0 has been started with 1 drive (out of 3) and 1 spare. (initramfs) cat /proc/mdstat Personalities : [raid1] md0 : *active* (auto-read-only) raid1 sdb1[5] sda1[6](S) 19514240 blocks super 1.2 [3/1] [U__] unused devices: none ... which will make the RAID device available in /dev/disk/by-uuid/ (initramfs) ls -l /dev/disk/by-uuid/ total 0 lrwxrwxrwx1 009 Jul 13 15:24 *c4263f89-eb0c-4372-90ae-ce1a1545613e* - ../../md0 lrwxrwxrwx1 00 10 Jul 13 15:20 cbeaebcb-2c55-48c0-b6bd-d5e8a5c4ac06 - ../../sdb2 lrwxrwxrwx1 00 10 Jul 13 15:20 ff2bae51-c5b8-41e3-855b-68ee57b61c0c - ../../sda2 Now, if I exit the emergency shell the system is able to boot without problems. In *bug* report *#784070* it is mentioned that with the version of mdadm shipping with Debian Jessie, the --run parameter seems to be ignored when used in conjunction with --scan. According to the man page it is supposed to activate all arrays even if they are degraded. But instead, any arrays that are degraded are marked as 'inactive'. If the root filesystem is on one of those inactive arrays, the boot process is halted. As suggested in the bug report (see message#109) I have changed the file /usr/share/initramfs-tools/scripts/local-top/mdadm and used the comand update-initramfs -u in order to update /boot/initrd.img-3.16.* (... you might first want to make a copy of this file before update.) After reboot the system is able to start even if some disks (out of the RAID device) are missing (see bootlog from serial console below): ... Begin: Running /scripts/init-premount ... done. Begin: Mounting root file system ... Begin: Running /scripts/local-top ... Begin: Assembling all MD arrays ... [ 24.799665] random : nonblocking pool is initialized Failure: failed to assemble all arrays. done. Begin: Assembling all MD arrays ... *Warning: failed to assemble all arrays...attempting individual starts* Begin: attempting mdadm --run md0 ... [ 24.883069] md: raid1 personality registered for level 1 [ 24.889111] md/raid1:md0: active with 2 out of 3 mirrors [ 24.894598] md0: detected capacity change from 0 to 19982581760 mdadm: started array /dev/md/0 [ 24.908255] md0: unknown partition table *Success: started md0* done. done. Begin: Running /scripts/local-premount ... done. Begin: Checking root file system ... fsck from util-linux 2.25.2 /dev/md0: clean, 36905/1220608 files, 398026/4878560 blocks done. ... Problem solved ... ... and many thanks to Phil. PS: There is still one thing I do not understand: The file etc/mdadm/mdadm.conf (within initrd.img.*) contains an UUID (see below) ... ARRAY /dev/md/0
Bug#791794: RAID device not active during boot
On Tue, 2015-07-14 at 13:52 +0200, Peter Nagel wrote: [...] As suggested in the bug report (see message#109) For others reading along this is in #784070 not #791794. Problem solved ... ... and many thanks to Phil. Huzzah! PS: There is still one thing I do not understand: The file etc/mdadm/mdadm.conf (within initrd.img.*) contains an UUID (see below) ... ARRAY /dev/md/0 metadata=1.2 UUID=92da2301:37626555:6e73a527:3ccc045f name=debian:0 spares=1 ... wich seems to be different from the output of ls -l /dev/disk/by-uuid: lrwxrwxrwx 1 root root 9 Jul 14 11:27 c4263f89-eb0c-4372-90ae-ce1a1545613e - ../../md0 I _think_ this is because the former is the UUID of the RAID device, while the latter is the UUID of the filesystem contained within it. Ian. -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/1436886620.25044.106.ca...@debian.org
Bug#791794: RAID device not active during boot
Am 11.07.2015 18:40, schrieb Philip Hands: ... which is what suggests to me that it's been broken by other means -- the fact that one can apparently start it by hand tells you that it's basically working, so I'd think the described symptoms point strongly towards duff mdadm.conf in the initramfs. N.B. I've not very had much to do with systemd, so am in no sense an expert about that, but I've been using software raid and initrd's since almost as soon as they were available, and the idea that this would be down to systemd does not ring true. Thanks for pointing out this. Hopefully, someone is able to solve this problem. smime.p7s Description: S/MIME Cryptographic Signature
Bug#791794: RAID device not active during boot
Peter Nagel peter.na...@kit.edu writes: Am 11.07.2015 18:40, schrieb Philip Hands: ... which is what suggests to me that it's been broken by other means -- the fact that one can apparently start it by hand tells you that it's basically working, so I'd think the described symptoms point strongly towards duff mdadm.conf in the initramfs. N.B. I've not very had much to do with systemd, so am in no sense an expert about that, but I've been using software raid and initrd's since almost as soon as they were available, and the idea that this would be down to systemd does not ring true. Thanks for pointing out this. Hopefully, someone is able to solve this problem. Well, yes -- _you_ can hopefully. 0) (just in case you've not already done so, check all the bits suggested in the warning that you quoted initially, about the contents of /proc/... etc.) 1) on the system when booted up, check the current state of your /etc/mdadm/mdadm.conf Compare it with the output of: mdadm --examine --scan If there are significant differences (other than the missing disk), then fix them. 2) have a look at your initrd, thus: mkdir /tmp/initrd ; cd /tmp/initrd ; zcat /boot/initrd.img-* | cpio -iv --no-absolute-filenames (of course, being an ARM thing, you probably have some sort of uInitrd thing as well, so I guess it's possible to break things between the initrd.img and that, but someone who knows about such things would need to tell you about that). Anyway, you should have something like this: /tmp/initrd$ find . -name mdadm\* ./scripts/local-top/mdadm ./etc/mdadm ./etc/mdadm/mdadm.conf ./etc/modprobe.d/mdadm.conf ./conf/mdadm ./sbin/mdadm so, take a look at that lot to see if you can spot what's up. As an example, this is what I see on a little amd64 RAID box with Jessie, which I have to hand: root@linhost-th:/tmp/initrd# cat conf/mdadm MD_HOMEHOST='linhost-th' MD_DEVS=all root@linhost-th:/tmp/initrd# cat etc/mdadm/mdadm.conf HOMEHOST system ARRAY /dev/md/2 metadata=1.2 UUID=00e84ce1:d96de981:375caa64:dac234f9 name=grml:2 ARRAY /dev/md/3 metadata=1.2 UUID=c9871cb8:46a3dd98:d9505965:5bd7dfe2 name=grml:3 (I tend to number my md's to match the partitions they sit on, hence the 2 3) 3) save a copy of your old initrd.img somewhere, then run: update-initramfs -u and try a reboot -- if it works, unpack both initrd's in adjacent directories, and use diff -ur to spot what changed, and report back here. 4) If it didn't work, once in the emergency shell, try running: sh -x /scripts/local-top/mdadm and see if you can see why it's not working when starting things by hand does. 5) If that fails to be diagnostic, is there anything hiding in your uboot configuration that might be causing this? (assuming this box has u-boot) HTH Cheers, Phil. P.S. While you have the initrd unpacked, you might want to note that: root@linhost-th:/tmp/initrd# grep -r systemd . ./init:# Mount /usr only if init is systemd (after reading symlink) ./init:if [ ${checktarget##*/} = systemd ] read_fstab_entry /usr; then ./scripts/init-top/udev:/lib/systemd/systemd-udevd --daemon --resolve-names=never ./etc/lvm/lvm.conf:# systemd's socket-based service activation or run as an initscripts service ./lib/udev/rules.d/63-md-raid-arrays.rules:# Tell systemd to run mdmon for our container, if we need it. Binary file ./lib/systemd/systemd-udevd matches Binary file ./lib/x86_64-linux-gnu/libselinux.so.1 matches Binary file ./bin/kmod matches Binary file ./bin/udevadm matches while the scripts on the initrd image are systemd-aware, it's init is actually a shell script -- so you're running busybox as your init at this point. Also: root@linhost-th:/tmp/initrd# grep -r 'Gave up waiting for' . ./scripts/local: echo Gave up waiting for $2 device. Common problems: this is the script that's dropping you into the emergency shell. The thing that starts the shell is the panic() function from scripts/functions -- I can see that that will do a timed reboot if you've got panic=... on the kernel command line, but otherwise not. Would you have something like that on your command line? (as mentioned in the warning you quoted, /proc/cmdline tells you) If not, do you perhaps have a hardware watchdog, or some such? -- |)| Philip Hands [+44 (0)20 8530 9560] HANDS.COM Ltd. |-| http://www.hands.com/http://ftp.uk.debian.org/ |(| Hugo-Klemm-Strasse 34, 21075 Hamburg,GERMANY signature.asc Description: PGP signature
Bug#791794: RAID device not active during boot
The problem might be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=789152. However, in my case everything seems to be fine as long as all harddisks (within the RAID) are working. The Problem appears only if during boot one (or more) disk(s) out of the RAID device have a problem. The problem might be related to the fact that jessie comes with a new init system which has a stricter handling of failing auto mounts during boot. If it fails to mount an auto mount, systemd will drop to an emergency shell rather than continuing the boot - see release-notes (section 5.6.1): https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#systemd-upgrade-default-init-system For example: If you have installed your system to a RAID1 device and the system is faced with a power failure which (might at the same time) causes a damage to one of your harddisks (out of this RAID1 device) your system will (during boot) drop to an emergency shell rather than boot from the remaining harddisk(s). I found that during boot (for some reason) the RAID device is not active anymore and therefore not available within /dev/disk/by-uuid (what causes the drop to the emergency shell). A quickfix (to boot the system) would be, to re-activate the RAID device (e.g. /dev/md0) from the emergency shell ... mdadm --stop /dev/md0 mdadm --assemble /dev/md0 ... and to exit the shell. Nevertheless, it would be nice if the system would boot automatically (as it is known to happend under wheezy) in order to be able to use e.g. a spare disk for data synchronization. -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/55a0d0cf.8060...@kit.edu
Bug#791794: RAID device not active during boot
On Sat, Jul 11, 2015 at 10:16:15AM +0200, Nagel, Peter (IFP) wrote: The problem might be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=789152. However, in my case everything seems to be fine as long as all harddisks (within the RAID) are working. The Problem appears only if during boot one (or more) disk(s) out of the RAID device have a problem. The problem might be related to the fact that jessie comes with a new init system which has a stricter handling of failing auto mounts during boot. If it fails to mount an auto mount, systemd will drop to an emergency shell rather than continuing the boot - see release-notes (section 5.6.1): https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#systemd-upgrade-default-init-system Would a temporary work-around be to use another init system? For example: If you have installed your system to a RAID1 device and the system is faced with a power failure which (might at the same time) causes a damage to one of your harddisks (out of this RAID1 device) your system will (during boot) drop to an emergency shell rather than boot from the remaining harddisk(s). I found that during boot (for some reason) the RAID device is not active anymore and therefore not available within /dev/disk/by-uuid (what causes the drop to the emergency shell). A quickfix (to boot the system) would be, to re-activate the RAID device (e.g. /dev/md0) from the emergency shell ... mdadm --stop /dev/md0 mdadm --assemble /dev/md0 ... and to exit the shell. Nevertheless, it would be nice if the system would boot automatically (as it is known to happend under wheezy) in order to be able to use e.g. a spare disk for data synchronization. After all, isn't it the whole point of a RAID1 that it can keep going when one of its hard drives fails? I currently have this situation on a wheezy system, and it will continue until I have the replacement physical drive prepared for installation. The RAID1 is running fine with just one physical drive. It would be seriously inconvenient to be unable to boot in a straightforward manner. It's not as if it's being quiet about the matter -- I keep getting emails elling me that one of the drives is missing. -- hendrik -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20150711161247.ga5...@topoi.pooq.com
Bug#791794: RAID device not active during boot
Hendrik Boom hend...@topoi.pooq.com writes: On Sat, Jul 11, 2015 at 10:16:15AM +0200, Nagel, Peter (IFP) wrote: The problem might be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=789152. However, in my case everything seems to be fine as long as all harddisks (within the RAID) are working. The Problem appears only if during boot one (or more) disk(s) out of the RAID device have a problem. The problem might be related to the fact that jessie comes with a new init system which has a stricter handling of failing auto mounts during boot. If it fails to mount an auto mount, systemd will drop to an emergency shell rather than continuing the boot - see release-notes (section 5.6.1): https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#systemd-upgrade-default-init-system Would a temporary work-around be to use another init system? It seems very unlikely to me that the init system has anything to do with this. If switching away from systemd fixes it, then that would suggest that the file system in question is not actually needed for boot, so should be marked as such in fstab (with nofail or noauto) If getting past that point (by using sysvinit, which ignores the failure) results in an operational RAID (presumably running in degraded mode), then that would suggest that it's only being started by the /etc/init.d/mdadm script, which would seem to suggest that the scripts in the initramfs are not doing it, which would normally be a consequence of having something wrong with /etc/mdamdm/mdadm.conf when the initramfs was built. The underlying problem is the failure to bring up the raid in the initramfs, which is before systemd gets involved. For example: If you have installed your system to a RAID1 device and the system is faced with a power failure which (might at the same time) causes a damage to one of your harddisks (out of this RAID1 device) your system will (during boot) drop to an emergency shell rather than boot from the remaining harddisk(s). I found that during boot (for some reason) the RAID device is not active anymore and therefore not available within /dev/disk/by-uuid (what causes the drop to the emergency shell). A quickfix (to boot the system) would be, to re-activate the RAID device (e.g. /dev/md0) from the emergency shell ... mdadm --stop /dev/md0 mdadm --assemble /dev/md0 ... and to exit the shell. Nevertheless, it would be nice if the system would boot automatically (as it is known to happend under wheezy) in order to be able to use e.g. a spare disk for data synchronization. After all, isn't it the whole point of a RAID1 that it can keep going when one of its hard drives fails? Exactly, which is what suggests to me that it's been broken by other means -- the fact that one can apparently start it by hand tells you that it's basically working, so I'd think the described symptoms point strongly towards duff mdadm.conf in the initramfs. N.B. I've not very had much to do with systemd, so am in no sense an expert about that, but I've been using software raid and initrd's since almost as soon as they were available, and the idea that this would be down to systemd does not ring true. Cheers, Phil. -- |)| Philip Hands [+44 (0)20 8530 9560] HANDS.COM Ltd. |-| http://www.hands.com/http://ftp.uk.debian.org/ |(| Hugo-Klemm-Strasse 34, 21075 Hamburg,GERMANY signature.asc Description: PGP signature