I spent the day recovering from a Gentoo upgrade, and thought I'd document
the experience in case it helps someone else.

I'm running a custom kernel 2.6.25-gentoo-r7 on amd64, though I don't think
the rarer hardware is relevant.

I tend to put off upgrading my Gentoo box because anytime I do, something
breaks.  I'm afraid I haven't changed my opinion about that.  Anyway, I did
"emerge --update --deep world" and plugged my ears. Some 600-odd packages
(and a few simpler problems) later, the system seemed to be doing okay.  So
I thought I'd see if it could survive a reboot.  No, it couldn't.

On boot it failed checking the root file system and dropped into the repair
shell.  The reason the fsck failed is that the root pseudo device file
/dev/md0, didn't exist.  The root file system was actually, fine, though.
Inside the repair shell, I could see all the files from my root, but there
wasn't much in /dev.  (I have the md stuff compiled in to the kernel, and
don't use an initrd, so it wasn't an initrd problem.)

*Short Solution

*The problem was with udev, the facility which automatically populates the
/dev directory.  During the upgrade, emerge noted that my kernel version was
a bit early, but acceptable.  What was missing, apparently, was the signalfd
syscall, which that kernel version either doesn't have or I hadn't
configured.  Apparently, udev has only started using signalfd recently, so
the solution was to downgrade to an older version of udev (udev-141 to be
precise).

*What I Actually Did To Get There*

Of course, I didn't know that at first.  Just had a fun unbootable system.
I might have been able to simply emerge the downgrade from the repair shell
(the network did come up), but I didn't know to try that yet.  I figured I
wanted to find some way to make the system boot.  Since the failing file
check is done from /etc/init.d/checkroot, I added a mknod command to create
the device node before trying to run the file check.  At the start of the
start() method:

        if [ ! -e /dev/md0 ] ; then
           mknod -m 0660 /dev/md0 b 9 0
        fi

It's a hack, not a solution, but it did make the system boot, to a rather
crippled state.  Since there were a lot of devices missing, a lot of
services wouldn't start.  (If you're using a more boring root partition, it
might be something like "mknod -m 0660 /dev/sda1 b 8 1")

So I had managed by now to gather that udev wasn't working, but I didn't
know why.  My first thought was to try "/etc/init.d/udev start", to see if
it would start.  But it told me that the script is written for baselevel-2,
and I shouldn't use it on baselevel-1.  Following a bit of googling about
what the heck a baselevel is, I gathered that I was using baselevel-1, and
so the service wasn't supposed to be started that way.   So it wasn't a bug
that it wouldn't start that way.  Another page suggested trying to run it
directly, with "/sbin/udevd --daemon", which gave the message "error getting
signalfd".  That told my why it didn't start. This message was also in the
logs, but for some reason I didn't look there until later.

So back to Google, and I found a message on a Debian board noting that udev
had started using signalfd recently.  This suggested an old version might do
the trick.  I tried one, and it did.

Reply via email to