There are a number of infelicities in the way we currently handle the I/O plumbing for devices in the kernel. These include: - cloning devices exist but as currently implemented violate layering abstractions; - every file system needs to have cutpaste vnode ops tables for device vnodes; - the split between block and character devices was never particularly well anchored in reality (e.g. tapes) but has aged poorly since as new classes of devices have appeared; - because we don't distinguish between different classes of devices nearly all device ops travel through the system as ioctls; - because of the ensuing complexity, dispatching of ioctls is a mess; there are many cases where ioctl handlers match some ioctls specifically and pass anything else on, which introduces opportunities for various kinds of bugs; - adding a new device-level operation to cdevsw or bdevsw requires touching every driver, including those it is completely irrelevant to; - if you have multiple sets of device nodes on a system (e.g. in chroots) operations that affect the device nodes themselves can behave differently or strangely depending on which copy you touch (we have had multiple generations of hacks to mitigate this, and none have been completely satisfactory); - because we don't distinguish device classes in any modern sense of the term they cannot be addressed or reasoned about in system config or things like kauth policies; - and probably other things I haven't thought of.
I've been mumbling on and off for a long time about various parts of this problem, and I think it's time to propose a unified architecture for a solution. Note that the changes required are nontrivial and this is not going to happen all at once (or anytime soon); the goal of blathering about it is to try to reach agreement on a place that we want to get to eventually... and also, to smoke out any places where the proposed architecture won't actually work or is inconsistent with what happens on the ground. There are four major interconnected sets of changes I have in mind to address these problems. (1) Create explicit device classes. This would be adding a layer of indirection between struct cdevsw/bdevsw and drivers; so e.g. a mouse driver would, instead of declaring a struct cdevsw, declare a struct mouse_dev containing operations on mice, and the cdevsw entry would point at this. For disks, which for historical reasons live in both cdevsw and bdevsw, both entries would point at the same disk_dev. (2) Abolish ioctl inside the kernel, or at least within the device tree. Given separate device classes, each operation needed can be made its own operation on that device class (that is, a function pointer in struct foo_dev) with the ensuing large increase in clarity about what the operations are and where they need to be implemented. Plus this way we get type safety for the arguments. (3) Rearrange the way operations dispatch to devices. The traditional model is that opening a device gives you a device vnode, and device ops are dispatched to the device driver by looking up the major number and indirecting through either the cdevsw or bdevsw table, and passing the minor number as an argument. (This throws away all information about how or when the device was opened, which is why cloners needed to do something different.) The proposed method is that device vnodes resolve the identity of the device when first loaded, and at runtime point to the e.g. struct disk_dev rather than remembering the major number; and for cloners they point to a device instance structure created when the device is opened, which holds the per-instance data the cloner needs. (It's not clear to me right now if only cloners should get instance structures, or if for uniformity it makes sense to allocate them for every driver, or if it should be a property of certain device classes -- the overhead of a allocating a handful of extra small structures isn't important, so it's mostly a code complexity question.) Operations on devices go to the device vnode and are then sent on to the driver directly; the cdevsw and bdevsw tables are used only when devices are first looked up. All ioctls are turned into explicit device-level operations in the device vnode's ioctl op. (3a) Further down the line it might make sense to make devices _not_ vnodes but instead make them different instances of struct file (either one for all devices or even one for each device class) -- this would move ioctl dispatching up a layer, which would be an improvement. But I don't think this needs to be part of the initial plan. Plus (4) Make device vnodes fs-independent and rearrange how looking them up works. Get rid of the extra ops table for devices that every fs has to have (and also the one for fifos); instead, make fs-level special file vnodes mostly-inert objects that don't support anything much besides getattr and setattr. Then, when namei produces a special file vnode, look up the driver that it references, produce a device vnode for that driver, and return that, with the FS's special file vnode hanging off to the side so it can be used for stat and chown/chmod operations. Note that in this model the device vnode is itself a mostly-stateless wrapper; it points at a device instance and at a special file vnode and dispatches operations to them, but doesn't do much of anything itself. This makes multiple special files for the same driver work as expected: all driver state is shared between all opens, but each open is associated with a specific special file and e.g. chmod on it won't affect others. Note that this set of changes also enables something else I've been talking about occasionally: storing driver names rather than major numbers in device special file inodes. In this world driver lookup happens only at namei time, rather than on every operation, so it's no longer necessary for it to be especially fast and it becomes ok to do it by string search. However, this is a separate matter and isn't part of the plan (and may not even be a good idea) so let's not bikeshed it just yet. One question is: what device classes does it make sense to materialize? ISTM that anything readily identifiable that there's more than one of is a reasonable candidate (disks, ttys, audio, framebuffers, mice, scanners, etc.) but I think the underlying criterion should be something slightly different. This paper: https://www.usenix.org/conference/osdi-04/recovering-device-drivers observed that manifesting device classes lets you write recovery logic such that if you need to shoot a driver, reset the hw, and restart it you can then restore the driver to the state the rest of the system expects it to be in. We ought to have an implementation of that :-) so I think that should be the basis for thinking about device classes. Another question is: how do minor numbers work in this world? I suspect that for most drivers the path of least resistance is to remember the minor number in the device vnode and pass it to the device ops. But it's also reasonable to create a device instance for each valid minor number, look that up at open time, and then dispatch via that instance aftewards. It may depend on the device class... it isn't clear to me right now whether cloners exist in all classes (meaning that all device classes will need machinery for handling explicit instances) or are specific to some classes and not others. Right now because cloners are messy they're probably not used in all the places they might potentially make sense. A third question: how does this affect interfaces? The answer is: hopefully as little as possible. Interfaces are their own mess :-| Anyhow, I think this architecture addresses all the problems cited. The critical question is: what have I overlooked? There are probably some issues I've thought about but failed to remember to discuss above; there are also probably some issues I've not thought about or am completely unaware of. If you are aware of any details anywhere that would explode all this please post. Also, if it seems unclear or vague on some particular point, please post too; reactions of that form sometimes just mean I didn't write clearly and should try again, but sometimes also reflect real problems or issues that have been overlooked. The ways in which the different sets of changes interact isn't necessarily obvious and might be wrong in places. And if you think it's all a terrible idea or that the problems at the beginning are nonissues, that's important to know too. Hopefully though we can reach some kind of conclusion about the direction to aim in. (How to get there without exploding the world on the way is then the next question...) Note that what's in this message is a summary of things I've been contemplating the past few years, and probably a fair number of people have heard parts of it before, but I think this is the first time I've tried to really roll it all together. -- David A. Holland dholl...@netbsd.org