Date: Mon, 17 Aug 2015 23:20:01 +0000 (UTC) From: mlel...@serpens.de (Michael van Elst) Message-ID: <20150817232001.13113a6...@mollari.netbsd.org>
| The following reply was made to PR bin/50108; it has been noted by GNATS. The quotes are from a message Michael van Elst made in reply to PR bin/50108 back on 17th August (2015 in case it isn't implied). The full message should be available in the PR. As people who have been following the (meandering) thread related to "beating a dead horse" on the netbsd-users list will know, I have been looking at supporting drives with 4K sector sizes properly in NetBSD. Currently what we have is a mess. Really, despite this ... | At some point, about when the SCSI subsystem was integrated into | the kernel, the model was changed, the kernel now uses the fixed | DEV_BSIZE=512 coordinates and the driver translates that into | physical blocks. being kind of close, it is just not true. Now that PR, and Michael's message, were mostly on the topic of kernel/user interactions, and noted that userland is expected to use sector size units, not DEV_BSIZE, and all is fine most of the time, except for code that's shared (shared WAPBL code was the issue of the PR). So, not getting all the in-kernel details precisely correct may be excused. For what I have been looking at however, that doesn't work, and that "the driver translates" is a problem. That's because it simply isn't true that the kernel uses DEB_BSIZE units (let's call those "blocks" for the purpose of this e-mail, and "sectors" will be things measured in the relevant native sector size) everywhere. For stuff related to low level properties of devices (like reading and writing labels, etc) the kernel uses sectors, and sector numbers, not blocks and block numbers. But they all end up going through the same driver interfaces to actually perform the I/O. Now at the minute, things are carefully arranged so that in the normal case (eg: a ffs on a drive) it all just works, the translations happen when they should and don't happen when they shouldn't. It is almost magic. Unfortunately, once we add the "stacked" devices (cgd, and ccd for sure, perhaps lvm and raidframe, I haven't looked at those enough to know) the model breaks down, and we get incorrect conversions. Currently at least cgd and (I believe) lvm just pretend that sectors are blocks, and that makes stuff work ... when sectors are blocks (ie: when sector size == DEV_BSIZE, which has been almost always) that just works (obviously) and when sectors are bigger than blocks, it also "works", you just get less available space (for 4K sector drives, cgd and lvm both give you 1/8 the space that you should have had on the device.) Sector sizes smaller than blocks are rare, and becoming rarer, we mostly just ignore that case (there are fragments of code in the kernel that pretend to allow it, but most of the code just assumes it cannot happen.) ccd (especially if combining a 4k byte sector device with a 512 byte sector device) is simply a mess - perhaps almost a candidate for extermination. (Or maybe it can be resuscitated, who knows right now?) raidframe I haven't thought about, or investigated, at all. I see two (wildly different) approaches (currently see) that could be used to fix the problems... One is to convert the kernel to use byte offsets absolutely everywhere. Convert to/from byte offsets when dealing with hardware (like disks that want a LBA, for some size of B) and with formats that store units different than bytes (like labels on disks, etc). But internally, everything would be counted in bytes, always (zero exceptions allowed.) The other is to carry explicit unit designators along with every blk/sec number, everywhere (so struct buf, which has b_blkno, would also need a b_blkunit field added for example - no endorsement implied for that name.) Either of those would remove the ambiguity that we currently face (and without needing to special case the code to deal with the possibility that sectors might be bigger, or smaller, than blocks). The first of those has the advantage that it more closely models the kernel/user generic interface for most i/o sys calls. Read/write get told how many bytes to transfer, not how many blocks, or sectors, even when transferring to/from a device that requires a fixed number of sectors in order to work. Similarly lseek always gets given a byte offset, never a block or sector number. It has two disadvantages that I can see at the minute. One is that it would require "large int" fields (at least 64 bits) everywhere, which would infect even small old systems (sun2, vax, ...) which are probably never going to see a device with anything but 512 byte sectors, nor anything big enough to need more than 32 bits as a sector number ... and which tend not to have native 64 bit arithmetic available in hardware (and are already slow.) Second, it makes some things that are currently constants become variable. For example, a GPT primary label goes in sector 1. That's at byte offset 512 when that is the sector size, or byte offset 4096 for 4K drives. How much of a problem this would be I haven't really investigated yet, but it is likely to infect far more than we'd like to hope I suspect. The second choice solution requires changes to data structs and func signatures all over the place, it would be a major shake up of the internals of the system. Currently I have no real preference (a slight leaning towards the 2nd) so I'm seeking opinions. Is one of these approaches better than the other (and if so why) or is there some other way I haven't considered yet? kre