Bug#791794: RAID device not active during boot

2015-07-14 Thread Peter Nagel


When all disks are available during boot the system is starting without 
problems:


 ls -l /dev/disk/by-uuid/
   total 0
   lrwxrwxrwx 1 root root 10 Jul 13 18:15 
2138f67e-7b9e-4960-80d3-2ac2ce31d882 - ../../sdc2
   lrwxrwxrwx 1 root root 10 Jul 13 18:15 
21a660eb-729d-48fe-b9e3-140ae0ee79f4 - ../../sdd2
   lrwxrwxrwx 1 root root  9 Jul 13 18:15 
*c4263f89-eb0c-4372-90ae-ce1a1545613e* - ../../*md0*
   lrwxrwxrwx 1 root root 10 Jul 13 18:15 
cbeaebcb-2c55-48c0-b6bd-d5e8a5c4ac06 - ../../sdb2
   lrwxrwxrwx 1 root root 10 Jul 13 18:15 
ff2bae51-c5b8-41e3-855b-68ee57b61c0c - ../../sda2



When starting the system with only two (instead of four) disks I'm 
droped into emergency shell with the following error message:


   ALERT!  /dev/disk/by-uuid/*c4263f89-eb0c-4372-90ae-ce1a1545613e* 
does not exist.  Dropping to a shell!


... which seems to be consistent with the fact that the UUID for 
/dev/md0  is not available ...


   (initramfs)  ls -l /dev/disk/by-uuid/
   total 0
   lrwxrwxrwx1 00   10 Jul 13 15:20 
cbeaebcb-2c55-48c0-b6bd-d5e8a5c4ac06 - ../../sdb2
   lrwxrwxrwx1 00   10 Jul 13 15:20 
ff2bae51-c5b8-41e3-855b-68ee57b61c0c - ../../sda2



... which in turn is caused the RAID device itself being inactive at 
that time:


   (initramfs)  cat /proc/mdstat
   Personalities :
   md0 : *inactive* sdb1[5](S) sda1[6](S)
 39028736 blocks super 1.2

   unused devices: none


In order to re-activate  /dev/md0  I use the following commands:

   (initramfs)  mdadm --stop /dev/md0
   [  178.719551] md: md0 stopped.
   [  178.722463] md: unbindsdb1
   [  178.725386] md: export_rdev(sdb1)
   [  178.728804] md: unbindsda1
   [  178.731711] md: export_rdev(sda1)
   mdadm: stopped /dev/md0

   (initramfs)  mdadm --assemble /dev/md0
   [  214.171191] md: md0 stopped.
   [  214.184471] md: bindsda1
   [  214.195838] md: bindsdb1
   [  214.218253] md: raid1 personality registered for level 1
   [  214.226156] md/raid1:md0: active with 1 out of 3 mirrors
   [  214.231651] md0: detected capacity change from 0 to 19982581760
   [  214.247893]  md0: unknown partition table
   mdadm: /dev/md0 has been started with 1 drive (out of 3) and 1 spare.

   (initramfs)  cat /proc/mdstat
   Personalities : [raid1]
   md0 : *active* (auto-read-only) raid1 sdb1[5] sda1[6](S)
 19514240 blocks super 1.2 [3/1] [U__]

   unused devices: none


... which will make the RAID device available in /dev/disk/by-uuid/

   (initramfs)  ls -l /dev/disk/by-uuid/
   total 0
   lrwxrwxrwx1 009 Jul 13 15:24 
*c4263f89-eb0c-4372-90ae-ce1a1545613e* - ../../md0
   lrwxrwxrwx1 00   10 Jul 13 15:20 
cbeaebcb-2c55-48c0-b6bd-d5e8a5c4ac06 - ../../sdb2
   lrwxrwxrwx1 00   10 Jul 13 15:20 
ff2bae51-c5b8-41e3-855b-68ee57b61c0c - ../../sda2



Now, if I  exit  the emergency shell  the system is able to boot without 
problems.


In *bug* report *#784070* it is mentioned that with the version of 
mdadm shipping with Debian Jessie, the --run parameter seems to be 
ignored when used in conjunction with --scan. According to the man page 
it is supposed to activate all arrays even if they are degraded. But 
instead, any arrays that are degraded are marked as 'inactive'. If the 
root filesystem is on one of those inactive arrays, the boot process is 
halted.


As suggested in the bug report (see message#109) I have changed the 
file  /usr/share/initramfs-tools/scripts/local-top/mdadm  and used the 
comand  update-initramfs -u  in order to update /boot/initrd.img-3.16.*  
(... you might first want to make a copy of this file before update.)
After reboot the system is able to start even if some disks (out of the 
RAID device) are missing (see bootlog from serial console below):


   ...
   Begin: Running /scripts/init-premount ... done.
   Begin: Mounting root file system ... Begin: Running 
/scripts/local-top ... Begin: Assembling all MD arrays ... [ 24.799665] 
random

   : nonblocking pool is initialized
   Failure: failed to assemble all arrays.
   done.
   Begin: Assembling all MD arrays ... *Warning: failed to assemble all 
arrays...attempting individual starts*
   Begin: attempting mdadm --run md0 ... [   24.883069] md: raid1 
personality registered for level 1

   [   24.889111] md/raid1:md0: active with 2 out of 3 mirrors
   [   24.894598] md0: detected capacity change from 0 to 19982581760
   mdadm: started array /dev/md/0
   [   24.908255]  md0: unknown partition table
*Success: started md0*
   done.
   done.
   Begin: Running /scripts/local-premount ... done.
   Begin: Checking root file system ... fsck from util-linux 2.25.2
   /dev/md0: clean, 36905/1220608 files, 398026/4878560 blocks
   done.
   ...


Problem solved ...
... and many thanks to Phil.



PS:
There is still one thing I do not understand:
The file  etc/mdadm/mdadm.conf  (within initrd.img.*) contains an UUID 
(see below) ...


   ARRAY /dev/md/0 

Bug#791794: RAID device not active during boot

2015-07-14 Thread Ian Campbell
On Tue, 2015-07-14 at 13:52 +0200, Peter Nagel wrote:
[...]
 As suggested in the bug report (see message#109)

For others reading along this is in #784070 not #791794.

 Problem solved ...
 ... and many thanks to Phil.

Huzzah!

 PS:
 There is still one thing I do not understand:
 The file  etc/mdadm/mdadm.conf  (within initrd.img.*) contains an UUID
 (see below) ...
 
ARRAY /dev/md/0 metadata=1.2 UUID=92da2301:37626555:6e73a527:3ccc045f 
 name=debian:0
   spares=1
 
 ... wich seems to be different from the output of  ls -l /dev/disk/by-uuid:
 
lrwxrwxrwx 1 root root  9 Jul 14 11:27 
 c4263f89-eb0c-4372-90ae-ce1a1545613e - ../../md0 

I _think_ this is because the former is the UUID of the RAID device,
while the latter is the UUID of the filesystem contained within it.

Ian.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/1436886620.25044.106.ca...@debian.org



Bug#791794: RAID device not active during boot

2015-07-12 Thread Peter Nagel

Am 11.07.2015 18:40, schrieb Philip Hands:


... which is what suggests to me that it's been broken by other
means -- the fact that one can apparently start it by hand tells you
that it's basically working, so I'd think the described symptoms point
strongly towards duff mdadm.conf in the initramfs.

N.B. I've not very had much to do with systemd, so am in no sense an
expert about that, but I've been using software raid and initrd's since
almost as soon as they were available, and the idea that this would be
down to systemd does not ring true.


Thanks for pointing out this.
Hopefully, someone is able to solve this problem.





smime.p7s
Description: S/MIME Cryptographic Signature


Bug#791794: RAID device not active during boot

2015-07-12 Thread Philip Hands
Peter Nagel peter.na...@kit.edu writes:

 Am 11.07.2015 18:40, schrieb Philip Hands:

 ... which is what suggests to me that it's been broken by other
 means -- the fact that one can apparently start it by hand tells you
 that it's basically working, so I'd think the described symptoms point
 strongly towards duff mdadm.conf in the initramfs.

 N.B. I've not very had much to do with systemd, so am in no sense an
 expert about that, but I've been using software raid and initrd's since
 almost as soon as they were available, and the idea that this would be
 down to systemd does not ring true.

 Thanks for pointing out this.
 Hopefully, someone is able to solve this problem.

Well, yes -- _you_ can hopefully.

 0) (just in case you've not already done so, check all the bits
 suggested in the warning that you quoted initially, about the
 contents of /proc/... etc.)

 1) on the system when booted up, check the current state of your
/etc/mdadm/mdadm.conf
 
Compare it with the output of:

  mdadm --examine --scan

If there are significant differences (other than the missing disk),
then fix them.

  2) have a look at your initrd, thus:

mkdir /tmp/initrd ; cd /tmp/initrd ; zcat /boot/initrd.img-* | cpio -iv 
--no-absolute-filenames

(of course, being an ARM thing, you probably have some sort of
uInitrd thing as well, so I guess it's possible to break things
between the initrd.img and that, but someone who knows about such
things would need to tell you about that).

Anyway, you should have something like this:

  /tmp/initrd$ find . -name mdadm\*
  ./scripts/local-top/mdadm
  ./etc/mdadm
  ./etc/mdadm/mdadm.conf
  ./etc/modprobe.d/mdadm.conf
  ./conf/mdadm
  ./sbin/mdadm

so, take a look at that lot to see if you can spot what's up.

As an example, this is what I see on a little amd64 RAID box with
Jessie, which I have to hand:

  root@linhost-th:/tmp/initrd# cat conf/mdadm 
  MD_HOMEHOST='linhost-th'
  MD_DEVS=all
  root@linhost-th:/tmp/initrd# cat etc/mdadm/mdadm.conf 
  HOMEHOST system
  ARRAY /dev/md/2  metadata=1.2 UUID=00e84ce1:d96de981:375caa64:dac234f9 
name=grml:2
  ARRAY /dev/md/3  metadata=1.2 UUID=c9871cb8:46a3dd98:d9505965:5bd7dfe2 
name=grml:3

  (I tend to number my md's to match the partitions they sit on,
   hence the 2  3)

  3) save a copy of your old initrd.img somewhere, then run: 

 update-initramfs -u

and try a reboot -- if it works, unpack both initrd's in adjacent
directories, and use diff -ur to spot what changed, and report back
here.

  4) If it didn't work, once in the emergency shell, try running:

sh -x /scripts/local-top/mdadm

   and see if you can see why it's not working when starting things by
   hand does.

  5) If that fails to be diagnostic, is there anything hiding in your
 uboot configuration that might be causing this? (assuming this box
 has u-boot)

HTH

Cheers, Phil.

P.S. While you have the initrd unpacked, you might want to note that:

  root@linhost-th:/tmp/initrd# grep -r systemd .
  ./init:# Mount /usr only if init is systemd (after reading symlink)
  ./init:if [ ${checktarget##*/} = systemd ]  read_fstab_entry /usr; 
then
  ./scripts/init-top/udev:/lib/systemd/systemd-udevd --daemon 
--resolve-names=never
  ./etc/lvm/lvm.conf:# systemd's socket-based service activation or run 
as an initscripts service
  ./lib/udev/rules.d/63-md-raid-arrays.rules:# Tell systemd to run mdmon 
for our container, if we need it.
  Binary file ./lib/systemd/systemd-udevd matches
  Binary file ./lib/x86_64-linux-gnu/libselinux.so.1 matches
  Binary file ./bin/kmod matches
  Binary file ./bin/udevadm matches

while the scripts on the initrd image are systemd-aware, it's init
is actually a shell script -- so you're running busybox as your init
at this point.

Also:

 root@linhost-th:/tmp/initrd# grep -r 'Gave up waiting for' .
 ./scripts/local:   echo Gave up waiting for $2 device.  Common 
problems:

this is the script that's dropping you into the emergency shell.

The thing that starts the shell is the panic() function from
scripts/functions -- I can see that that will do a timed reboot if
you've got panic=... on the kernel command line, but otherwise not.

Would you have something like that on your command line?  (as
mentioned in the warning you quoted, /proc/cmdline tells you)

If not, do you perhaps have a hardware watchdog, or some such?
-- 
|)|  Philip Hands  [+44 (0)20 8530 9560]  HANDS.COM Ltd.
|-|  http://www.hands.com/http://ftp.uk.debian.org/
|(|  Hugo-Klemm-Strasse 34,   21075 Hamburg,GERMANY


signature.asc
Description: PGP signature


Bug#791794: RAID device not active during boot

2015-07-11 Thread Nagel, Peter (IFP)
The problem might be related to 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=789152.
However, in my case everything seems to be fine as long as all harddisks 
(within the RAID) are working.
The Problem appears only if during boot one (or more) disk(s) out of the 
RAID device have a problem.


The problem might be related to the fact that jessie comes with a new 
init system which has a stricter handling of failing auto mounts 
during boot. If it fails to mount an auto mount, systemd will drop to 
an emergency shell rather than continuing the boot - see release-notes 
(section 5.6.1):
https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#systemd-upgrade-default-init-system 



For example:
If you have installed your system to a RAID1 device and the system is 
faced with a power failure which (might at the same time) causes a 
damage to one of your harddisks (out of this RAID1 device) your system 
will (during boot) drop to an emergency shell rather than boot from the 
remaining harddisk(s).
I found that during boot (for some reason) the RAID device is not active 
anymore and therefore not available within /dev/disk/by-uuid (what 
causes the drop to the emergency shell).


A quickfix (to boot the system) would be, to re-activate the RAID device 
(e.g. /dev/md0) from the emergency shell ...


mdadm --stop /dev/md0
mdadm --assemble /dev/md0

... and to exit the shell.

Nevertheless, it would be nice if the system would boot automatically 
(as it is known to happend under wheezy) in order to be able to use e.g. 
a spare disk for data synchronization.



--
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/55a0d0cf.8060...@kit.edu



Bug#791794: RAID device not active during boot

2015-07-11 Thread Hendrik Boom
On Sat, Jul 11, 2015 at 10:16:15AM +0200, Nagel, Peter (IFP) wrote:
 The problem might be related to
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=789152.
 However, in my case everything seems to be fine as long as all
 harddisks (within the RAID) are working.
 The Problem appears only if during boot one (or more) disk(s) out of
 the RAID device have a problem.
 
 The problem might be related to the fact that jessie comes with a
 new init system which has a stricter handling of failing auto
 mounts during boot. If it fails to mount an auto mount, systemd
 will drop to an emergency shell rather than continuing the boot -
 see release-notes (section 5.6.1):
 https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#systemd-upgrade-default-init-system

Would a temporary work-around be to use another init system?

 
 For example:
 If you have installed your system to a RAID1 device and the system
 is faced with a power failure which (might at the same time) causes
 a damage to one of your harddisks (out of this RAID1 device) your
 system will (during boot) drop to an emergency shell rather than
 boot from the remaining harddisk(s).
 I found that during boot (for some reason) the RAID device is not
 active anymore and therefore not available within /dev/disk/by-uuid
 (what causes the drop to the emergency shell).
 
 A quickfix (to boot the system) would be, to re-activate the RAID
 device (e.g. /dev/md0) from the emergency shell ...
 
 mdadm --stop /dev/md0
 mdadm --assemble /dev/md0
 
 ... and to exit the shell.
 
 Nevertheless, it would be nice if the system would boot
 automatically (as it is known to happend under wheezy) in order to
 be able to use e.g. a spare disk for data synchronization.

After all, isn't it the whole point of a RAID1 that it can keep going when 
one of its hard drives fails?

I currently have this situation on a wheezy system, and it will continue
until I have the replacement physical drive prepared for installation.  The
RAID1 is running fine with just one physical drive.  It would be 
seriously inconvenient to be unable to boot in a straightforward manner.
It's not as if it's being quiet about the matter -- I keep getting 
emails elling me that one of the drives is missing.

-- hendrik


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20150711161247.ga5...@topoi.pooq.com



Bug#791794: RAID device not active during boot

2015-07-11 Thread Philip Hands
Hendrik Boom hend...@topoi.pooq.com writes:

 On Sat, Jul 11, 2015 at 10:16:15AM +0200, Nagel, Peter (IFP) wrote:
 The problem might be related to
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=789152.
 However, in my case everything seems to be fine as long as all
 harddisks (within the RAID) are working.
 The Problem appears only if during boot one (or more) disk(s) out of
 the RAID device have a problem.
 
 The problem might be related to the fact that jessie comes with a
 new init system which has a stricter handling of failing auto
 mounts during boot. If it fails to mount an auto mount, systemd
 will drop to an emergency shell rather than continuing the boot -
 see release-notes (section 5.6.1):
 https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#systemd-upgrade-default-init-system

 Would a temporary work-around be to use another init system?

It seems very unlikely to me that the init system has anything to do
with this.

If switching away from systemd fixes it, then that would
suggest that the file system in question is not actually needed for
boot, so should be marked as such in fstab (with nofail or noauto)

If getting past that point (by using sysvinit, which ignores the
failure) results in an operational RAID (presumably running in degraded
mode), then that would suggest that it's only being started by the
/etc/init.d/mdadm script, which would seem to suggest that the scripts
in the initramfs are not doing it, which would normally be a consequence
of having something wrong with /etc/mdamdm/mdadm.conf when the initramfs
was built.

The underlying problem is the failure to bring up the raid in the
initramfs, which is before systemd gets involved.


 
 For example:
 If you have installed your system to a RAID1 device and the system
 is faced with a power failure which (might at the same time) causes
 a damage to one of your harddisks (out of this RAID1 device) your
 system will (during boot) drop to an emergency shell rather than
 boot from the remaining harddisk(s).
 I found that during boot (for some reason) the RAID device is not
 active anymore and therefore not available within /dev/disk/by-uuid
 (what causes the drop to the emergency shell).
 
 A quickfix (to boot the system) would be, to re-activate the RAID
 device (e.g. /dev/md0) from the emergency shell ...
 
 mdadm --stop /dev/md0
 mdadm --assemble /dev/md0
 
 ... and to exit the shell.
 
 Nevertheless, it would be nice if the system would boot
 automatically (as it is known to happend under wheezy) in order to
 be able to use e.g. a spare disk for data synchronization.

 After all, isn't it the whole point of a RAID1 that it can keep going when 
 one of its hard drives fails?

Exactly, which is what suggests to me that it's been broken by other
means -- the fact that one can apparently start it by hand tells you
that it's basically working, so I'd think the described symptoms point
strongly towards duff mdadm.conf in the initramfs.

N.B. I've not very had much to do with systemd, so am in no sense an
expert about that, but I've been using software raid and initrd's since
almost as soon as they were available, and the idea that this would be
down to systemd does not ring true.

Cheers, Phil.
-- 
|)|  Philip Hands  [+44 (0)20 8530 9560]  HANDS.COM Ltd.
|-|  http://www.hands.com/http://ftp.uk.debian.org/
|(|  Hugo-Klemm-Strasse 34,   21075 Hamburg,GERMANY


signature.asc
Description: PGP signature