The return of devfs

By Jake Edge
May 6, 2009

The drive for faster boot times has led to a number of changes in the kernel. Some, like the parallelization of USB initialization we looked at last week, have caused disruptions for some users. But others, like the recently proposed devtmpfs, have a different set of challenges. While it may provide a good solution to reducing boot times, devtmpfs faces some fairly stiff resistance, at least partially because it reminds some folks of a feature previously excised from the kernel, namely devfs.

The basic idea is to create a tmpfs early in the kernel initialization before the driver core has initialized. Then, as each device registers with the driver core, its major and minor numbers and device name can be used to create an entry in that filesystem. Eventually, the root filesystem will be mounted and the populated tmpfs can be mounted at /dev.

This has a number of benefits, all of which derive from the fact that no user-space support is required to have a working /dev directory. With the current udev-based approach, there is a need for a reasonably functional user-space environment for udev to operate in. For simplified booting scenarios—like rescue tools or using the init=/bin/sh kernel boot parameter—a functional /dev directory is needed, in particular because of dynamic device numbers. It would also be useful for embedded devices that do not need or want a full-featured user space.

Andrew Morton's immediate reaction was amusement: "Lol, devfs." Greg Kroah-Hartman, who authored the patch along with Kay Sievers and Jan Blunck, admitted that it was a kind of devfs: "Well, devfs 'done right' with hopefully none of the vfs problems the last devfs had. :)" But Morton is somewhat concerned that "devfs2", as he calls it, is just going over old ground:

I think Adam Richter's devfs rewrite (which, iirc, was tmpfs-based) would have fixed up these things. But it was never quite completed and came when minds were already made up.

I don't understand why we need devfs2, really. What problems are people having with [the] existing design?

Though the other advantages are important, Kroah-Hartman replied with the crux of the argument for devtmpfs:

Boot speed, boot speed, boot speed.

Oh, and reduction in complexity in init scripts, and saving embedded systems a lot of effort to implement a dynamic /dev properly (have you _seen_ what Android does to keep from having to ship udev? It's horrible...)

But Alan Cox is not so sure. His argument is that moving this functionality (back) into the kernel, just papers over a user-space problem, while increasing kernel, thus not pageable, memory usage. Others think that the kernel should just buffer uevents—the messages generated by the kernel to send to udev on device state changes—until udevd is started. But, that doesn't solve the synchronization problem: user space must still wait for a populated /dev hierarchy.

A problem with the current scheme is that it essentially does the device enumeration twice—once in the kernel as devices are registered and once in user space by udevd, when it gets started. The device information that was gathered by the kernel is lost. When udevd initializes, it walks the /sys directory to find devices, then creates device nodes for them. That can take 1-2 seconds on a complex system—on the order of twice the kernel boot time—but worse still, no other user-space processes can start until this "coldplug" pass has completed. Using devtmpfs, there will be a working /dev that other user-space code can use, so that the udev coldplug pass can be done in parallel.

Several alternate methods of solving the problem were proposed in the thread, but, by and large, Sievers was able to show why they didn't actually solve the problem. In some cases, the behavior of devfs is being incorrectly attributed to devtmpfs, but the two are quite different. The new scheme would create root-owned device nodes, with fixed 0600 permissions, for each device. It would avoid much of complexity of devfs. As Sievers puts it:

We are not implementing anything crazy here like devfs did, including the later versions - there is no modprobe behind your back, no lookup hooks, no stupid new naming scheme, no new filesystem type to register.

Christoph Hellwig objected to the proposal as well. Part of his complaint is how quickly devtmpfs was added to the linux-next tree, but he also sees it as adding devfs back into the kernel:

It basically does re-introduce devfs under a different name, and from looking at the implementation it might not be quite as bad a Gooch's original, but it's certainly worse than Adam Richters rewrite the we never ended up merging.

Now we might want to revisit the decision to leave all the device name handling to a userspace daemon, because it [proved] to be quite fragile under certain circumstances, and you apparently see performance issues.

Sievers outlines the differences between devtmpfs and Adam Richter's proposal from 2003. It mostly boils down to complexity; devtmpfs is a much simpler scheme, which really adds very little to the kernel. The implementation is around 300 lines of code, in comparison to roughly 3600 for devfs and 600 for an early version of Richter's mini-devfs.

Anticipating the next complaint, Sievers also points out that the device naming policy is already in the kernel, but that udev can override the kernel-supplied values if need be. From his perspective this has already occurred, making that an invalid argument against devtmpfs:

The kernel carries the policy today for 98% of the devices, if you change any driver given name, it will no longer show up in /dev with the current name. That's the reality since years, and will not be different anytime soon, there is no real naming policy besides the current kernel supplied names.

It is clear that the devtmpfs developers have put a fair amount of thought into just what was needed, and how it could work with existing code—both inside and outside the kernel. It is also clear that there is some resistance to returning to anything even remotely reminiscent of devfs. Because devtmpfs is really quite different, and has a nice effect on boot speed, one would think that it is likely to find its way into the mainline sooner or later. If no further objections are raised, and the linux-next trials go well, 2.6.31 may very well be the release that sees the inclusion of devtmpfs.

The return of devfs

Posted May 7, 2009 1:36 UTC (Thu) by arjan (subscriber, #36785) [Link]

This devtmpfs is not need to boot fast. Really.

This is a workaround for a certain distros crappy mkinitrd basically, and nothing more; if you do the initrd correct (or if you don't use an initrd at all), you don't need this "solution" and you'll even boot faster....

The return of devfs

Posted May 7, 2009 2:29 UTC (Thu) by foom (subscriber, #14868) [Link]

So, how come nobody has simply done a port of Debian's initramfs-tools to Fedora and SuSE so we
can be done with this whole mess of bad initramfs implementations? I've not played with any of
them extensively, but from what I've read it sounds like everyone else's initramfs implementations
are pretty terrible, and Debian's is flexible enough to be used by any distro already.

If that's all true, it ought not be much work to port it and demonstrate its superiority and people
can stop wasting time on this stuff, no?

The return of devfs

Posted May 7, 2009 5:42 UTC (Thu) by niner (subscriber, #26151) [Link]

Now I'm curious: what does Debian do right, while others do it wrong?

The return of devfs

Posted May 8, 2009 0:57 UTC (Fri) by drag (subscriber, #31333) [Link]

The only thing that Debian does right that I haven't seen in other distros is that Debian's
initramfs is easier for end users to hack with.

A) It's documented

B) Tools to do things like rebuild the initramfs are easily accessable, documented.

C) The initrd scripts are init-like. They are modular, commented, and occupy a directory
structure in /usr/share/initramfs/ that makes sense according to their purpose and what part
of the boot process they are executed.

D) They have examples on scripts you can make on your own.

E) If you install packages that may impact boot-up proceedures, like maybe some hardware,
like LVM, then hooks are added to that directory structure. If you do not have packages for
LVM then support is not in the initramfs irregardless.

F) It's very easy to add busybox support so that you can do things like shove a 'sh' into the
boot procedure and get a shell to troubleshoot your scripts.

G) It's documented

H) The scripts have examples and decent comments.

It's certainly not perfect and there are better ways to do stuff. But most of the time dealing
with initrd scripts they are just monolithic and very unfriendly.

With a evenings worth of effort I've done things like devise a means to reliably network boot
the system using iSCSI software initiator. I've created layered root file systems using
compressed read-only file systems and things like AUFS, which I used on my EEEPC for a
number of montsh.

I did all this using my own hooks and add-on scripts that didn't affect the the existing Debian
scripts and didn't break when the system was updated.

Sure it ends up creating a little 'mini-linux-distro' that gets loaded into RAM and that ends up
making stuff slow as shit, during the initial boot process, but is actually something that a
experienced administrator can usefully use.

Every other distribution I've used always had the most horrid hacks and special-purpose
scripts that supported whatever configurations the installer supported, but were a pain to deal
with and customize.

Oh, and they were all poorly documented.

The return of devfs

Posted May 8, 2009 0:58 UTC (Fri) by drag (subscriber, #31333) [Link]

Oh. And Midori mangles newlines in when you try to post comments. Hateful. Hateful. Hateful.

The return of devfs

Posted May 8, 2009 15:15 UTC (Fri) by MarkWilliamson (guest, #30166) [Link]

I'm posting from a KHTML embedded in Akregator and it newline mangles my posts
as well... Yet posting from KHTML in a full Konqueror instance does not. I think
Akregator has some _javascript_ restrictions but otherwise it's using the same engine.
Curious.

Got any interesting _javascript_ settings in Midori?

The return of devfs

Posted May 8, 2009 7:11 UTC (Fri) by niner (subscriber, #26151) [Link]

FWIW that sounds pretty much like mkinitrd on openSUSE. A directory structure
in /lib/mkinitrd/ containing startup scripts that may be put there by installed packages
(e.g. /lib/mkinitrd/scripts/boot-lvm2.sh is owned by the lvm2 package). Adding busybox
is just adding "busybox" to the feature list parameter of the mkinitrd call. The scripts
(including mkinitrd) are commented (which already saved me once) and the mkinitrd
manpage even contains instructions how to mount the root partition from a rescue
system with necessary bind mounts to be able to execute mkinitrd.

Seems like such a structure is the result of natural evolution :)

The return of devfs

Posted May 7, 2009 9:53 UTC (Thu) by vrfy (subscriber, #13362) [Link]

That's not quite right, distros don't do (needless) crack here. Some use braindead tools, which should not even exist in the first place, but that's a totally different story, which is not touched at all by devtmpfs.

Having static device nodes would be a very dangerous hack for a general purpose distro in the light of dynamic device numbers. You access a /dev name but you can't be sure, you talk to the right device. That's a problem you need to avoid for correctness, not for speed reasons. And we have many subsystems which have dynamic minors only. Even sd* disk nodes can be already, and likely will be dynamic for some systems pretty soon.

/dev needs to be on tmpfs these days for security reasons, because tools mess around here, and adding user access control lists to device nodes, and create tons of symlinks, which are only meaningful during the lifetime of a specific device.

Besides simplicity and reliability devtmpfs covers the transition time from the empty mounted tmpfs to the populated /dev. During this time, you can't do much else, but devtpmfs does not have that requirement at all, because /dev always reflects all currently known devices.

The return of devfs

Posted May 7, 2009 10:39 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

mkinitrd seems to be a rather nasty, complex thing though (even if Debian have apparently got it right). Perhaps 600 lines of kernel code is a price worth paying for simplifying things a bit. After all, the idea of an initramfs was (IIRC) to move things out of the kernel that could be done better in userspace. If this can be done more easily in the kernel, why not do it there?

The return of devfs

Posted May 7, 2009 8:00 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

I must admit that I wondered at the time that devfs was deprecated what was so important about device node naming policy that it had to be in the hands of the sysadmin. (Yes, I am not an "old *nix hand" as you will have realised). I would have thought that device-related policy is more useful at a higher level. And while I've always admired microkernels, Linux is not one, and devices are handled in the kernel, so it would make sense to have 600 lines of kernel code for handling the device nodes, rather than all the complex user land stuff there is now. (udev basically duplicates information in sysfs anyway - perhaps this code could be made even smaller by just having a /sys/nodes and linking /dev to that...?) I'm not saying that udev wasn't the right answer at the time, but it might be time to re-examine it.

The return of devfs

Posted May 7, 2009 8:59 UTC (Thu) by nix (subscriber, #2304) [Link]

Device node naming policy is a curious beast.

In some respects it must be invariant because it is hardwired into
binaries: some names are even in POSIX (/dev/null, /dev/full).
Without /dev/zero nothing will work, because the dynamic linker uses it.
This may as well go into the kernel, because nobody can ever change the
name without breaking things. Changing this sort of name is why devfs's
new /dev layout was so hard to adapt to.

In some respects it must be tunable by the local admin: only the admin
knows what groups and permissions she wants on any device, and only the
admin knows what she wants to name the USB flash disks that people plug
in. (These names are generally provided to automounters, or mounted by
hand, so it doesn't matter that their names are unpredictably set by
humans.)

In some respects, all that matters is that the name is *consistent*: e.g.
local fixed disks, which are often referenced in files such as /etc/fstab.
Much of the policy for that may as well go into the kernel, because nobody
really cares *what* the name is as long as it doesn't change.

The return of devfs

Posted May 7, 2009 9:13 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

I see the bit about the permissions. But what difference does it make what the device node for a flash drive is called, as long as the name fulfils certain criteria (i.e. well known, reasonably unique to the drive, etc). Policy can be applied to the mount point name :)

The return of devfs

Posted May 7, 2009 9:29 UTC (Thu) by nix (subscriber, #2304) [Link]

Yes indeed: that was covered by my second case. But *not all devices are
like that*.

The return of devfs

Posted May 7, 2009 9:35 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

But my point is that these days the user of the system will normally not be interacting with the device nodes and their names unless they are doing something relatively close to the hardware, in which case control over name policy may be nice cosmetics but probably not more. In other cases, something will be layered between the device node and the user, and while that something needs to know the name, it doesn't care more than that what it actually is.

The return of devfs

Posted May 7, 2009 10:43 UTC (Thu) by hppnq (subscriber, #14462) [Link]

But my point is that these days the user of the system will normally not be interacting with the device nodes and their names unless they are doing something relatively close to the hardware

If it's just your own system, then indeed, nobody cares. But think of managing hundreds or thousands of systems and you may see the value of being able to standardize device naming. The precise naming policy is not so interesting, but the fact that there is one that is consistent is really rather important.

The return of devfs

Posted May 7, 2009 12:21 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

Quite agree, which more or less falls under my second point. Again, that doesn't mean that you have to be able to choose the names yourself, only that you have to be sure they have been well and predictably chosen. Which is a problem that doesn't necessarily have to be solved over again by every administrator.

The return of devfs

Posted May 7, 2009 9:53 UTC (Thu) by epa (subscriber, #39769) [Link]

In some respects, all that matters is that the name is *consistent*: e.g. local fixed disks, which are often referenced in files such as /etc/fstab.

Often nowadays fstab refers to devices by their disk label, because the device names (sda1, sdb1, etc) might jump around depending on what's plugged in where. Which makes you wonder why the disk label names have to be checked in some magic way, rather than just being exported by the kernel under /dev/disklabel/xxx.

The return of devfs

Posted May 7, 2009 22:38 UTC (Thu) by nix (subscriber, #2304) [Link]

Having the kernel check disk labels at the time they probe for the devices
might work, but wouldn't work if the filesystem on the device was modular
and the module wasn't yet loaded.

(Also it's yet *more* nonswappable kernel code for a job very easily done
by userspace.)

The return of devfs

Posted Oct 5, 2009 7:47 UTC (Mon) by cmccabe (subscriber, #60281) [Link]

> Which makes you wonder why the disk label names have to be checked in some
> magic way, rather than just being exported by the kernel under
> /dev/disklabel/xxx.

If you're running a non-ancient system with udev, you can find all the disks by label under:

/dev/disk/by-label/

Actually, come to think of it, I'm not sure why the LABEL=foo hack was ever implemented. There must be a reason but I don't know it.

Colin

The return of devfs

Posted Oct 6, 2009 0:07 UTC (Tue) by nix (subscriber, #2304) [Link]

Firstly, LABEL= predates udev. Secondly, LABEL= works even before udev has
initialized. Thirdly, it works on systems without udev. (udev and mount
don't yet use the same probing library, but this is changing, I
understand.)

The return of devfs

Posted Oct 18, 2009 22:43 UTC (Sun) by cmccabe (subscriber, #60281) [Link]

Ah, that makes sense. Thanks.

The return of devfs

Posted Apr 2, 2010 11:15 UTC (Fri) by skitching (subscriber, #36856) [Link]

Hi nix,

You mention in your comment that "root=LABEL=" work even when there is no udev.

Does this also apply to "root=UUID="?

And would you be able to give me a hint about where to find that code? All the relevant rootfs mounting bits appear to be in init/do_mounts.c, but I can't find anything related to handling LABEL or UUID. Function name_to_dev_t maps things like "root=08:05", "root=0x0805", "root=/dev/sda5" to a (major,minor) but I can't see UUID/LABEL handling anywhere.

Thanks, Simon

The return of devfs

Posted Apr 2, 2010 12:09 UTC (Fri) by hppnq (subscriber, #14462) [Link]

I'm sure nix will have more useful information, but you could take a look at libblkid in the meantime. This is used by mount to find devices identified by label or uuid (check the util-linux source).

The return of devfs

Posted Apr 3, 2010 0:50 UTC (Sat) by nix (subscriber, #2304) [Link]

I never mentioned root=. For everything *other* than root filesystem
mounting, the code to look at is libblkid, which supports both LABEL= and
UUID=. That's what mount(8) uses.

Raw non-initramfs/initrd kernel root= parameterwise, all that is supported
in /dev/{fake device name}, root={major:minor}, and
root={major*256+minor}. This is of course yet another reason to use an
initramfs, which can easily use normal mount and/or blkid directly to
support whatever permutations of LABEL=, UUID= or whatever you prefer.

The return of devfs

Posted May 7, 2009 9:50 UTC (Thu) by epa (subscriber, #39769) [Link]

Me too. Why is it considered 'policy' (bad! bad!) for the kernel to specify the name of the device file, but perfectly okay (in classical UNIX terms) to have fixed major and minor device numbers? Surely the major and minor numbers are just as much policy as the name. If users really want their own names they can easily say 'ln -s /dev/kernel_name /dev/my_weird_name', just as easily as 'mknod /dev/my_weird_name b major minor'.

The return of devfs

Posted May 8, 2009 16:14 UTC (Fri) by rmini (subscriber, #4991) [Link]

You can't set ownership or permissions with symlinks, where you can with device nodes. That's probably the more important part of policy than symlinks. Additionally, some devices are difficult to name persistently.

The return of devfs

Posted May 7, 2009 11:36 UTC (Thu) by Quazatron (subscriber, #4368) [Link]

Looks like a classic case of a nice piece of software with a badly chosen name.

The return of devfs

Posted May 7, 2009 16:20 UTC (Thu) by iabervon (subscriber, #722) [Link]

In particular, it seems odd to me to call something "*fs" when it is not, in fact, a filesystem. This seems to be the kernel mounting a filesystem and creating some device nodes on it. Surely this is better than the hack currently used to parse the "root=/dev/sda1" option, which takes something that looks like a path and comes up with a dev_t without anything resembling a filesystem lookup, and therefore has nothing to do with anything that could be seen by userspace.

The return of devfs

Posted May 8, 2009 11:21 UTC (Fri) by meyert (subscriber, #32097) [Link]

"Boot speed, boot speed, boot speed.

I don't understand this. Can embedded system not jut use a static /dev tree? Isn't this true for android platfrom, that is intended for embedded use?

The return of devfs

Posted May 8, 2009 11:24 UTC (Fri) by pli (subscriber, #45060) [Link]

The day devfs was removed from mainline was a sad day. Yes, the devfs implementation had its problems, some major ones, but those could have been fixed (and were fixed by e.g. mini-devfs but was never merged). The whole thing about letting the kernel dynamically manage /dev is a sound, elegant and proven strategy. The udev debacle has been a mess from day one and now they want to save their crappy idea by re-introducing a semi-half-devfs that is in fact a udev-helper-devfs, with some terribly odd and confusing device node life-cycle handling. Part of me is happy though, because the udev people realize that this should be done in the kernel, but I'm worried that this is just a continuation of the udev madness.

I hope the submission of devtmpfs can restart the general devfs-discussion and lead to the implementation of a real and proper devfs (e.g. let's start from mini-devfs) so that we once and for all can leave udev behind us.

The return of devfs

Posted May 8, 2009 13:35 UTC (Fri) by incase (subscriber, #37115) [Link]

Oh well, sure udev has quite a few problems, but in my experience, it provides a much cleaner interface to device node (or symlink to device node) creation then hotplug ever did.
And I have quite a number of places where I need it:
Give serial USB devices consistent names, even when plugged in later, in another order or whatever.
Give different usb thumb drives consistent device names (as they mount at different places and are access restricted to specific users) etc...

I never managed to achieve this consistently with the old devfs/hotplug combo. Especially since it made a difference with that combo when a device was connected (before or after hotplug was first run)..

However, this concerns basically only the hotplug functions of udev, not the coldplug functions in their current full extend. This basically means I'm all for devtmpfs as a helper to make the device nodes that udev coldplug would create available before it finished running, so that other userspace processes could already start before udev coldplug finished.

regards,
Sven

The return of devfs

Posted May 8, 2009 14:21 UTC (Fri) by nix (subscriber, #2304) [Link]

Please no. devfs was an abomination.

The lifetime rules of objects in devtmpfs are simple: the nodes are
created by the kernel, and that's all. It's just a bunch of automatic
mknod()s. Everything else is handled by udev, as now.

The return of devfs

Posted May 11, 2009 4:01 UTC (Mon) by Kamilion (subscriber, #42576) [Link]

Sounds like a reasonable idea to me.
Mount a tmpfs on /dev, populate it with a few of the most meaningful device
nodes to get up and running enough to be able to load udev from somewhere or
just fall into a basic busybox shell.

I'm familiar with this fallback behavior in ubuntu, when their initramfs has
a heart attack on some esoteric hardware I have.

Most of the other major unixes that are still left have a kernel managed
dynamic device filesystem.

It seems like a good enough idea to me to have everything necessary to
bootstrap the most basic twenty or so system device nodes that nobody ever
renames strictly from 600 lines of code added to a kernel image.

--
Subscription settings: http://groups.google.com/group/linuxkernelnewbies/subscribe?hl=en

[lk] The return of devfs [LWN.net]

The return of devfs

Reply via email to