Re: assumption about a device's maxphys
On Tue, Oct 09, 2012 at 06:16:44PM -0700, Chuck Silvers wrote: hi thor, your tls-maxphys branch appears to assume that a given device's maxphys is constant, but consider the case of a software device like a logical volume from LVM. one very nice feature of a volume manager that I used to work with (veritas vxvm) is that it can migrate a volume to new storage on the fly, while the data in it is being read and written. with such a virtual device, I assume that intention would be that the maxphys of the volume would be the same as the maxphys of the underlying physical device, but the maxphys of the underlying physical device could be different before and after a volume migration, so that doesn't fit very well with the assumption that a device's maxphys doesn't change. caching the device maxphys in a file system's struct mount is also problematic since a file system may directly access multiple underlying devices, as ZFS does, and those devices may each have a different maxphys. could you think about how to enhance the new maxphys design to accomodate these cases? There's another case, which I think is worse: a raidframe volume, with underlying disks with different maxphys. If it's a raid-1 you can't predict from which disk a read will come from, so you don't know the maxphys. I wouldn't expect this case (a volume composed of multiple disks with different maxphys) to be that common, so I'm not sure we should optimise for this. The volume's maxphys would be the lower of all the devices's maxphys. A tricky case is when the new maxphys would be smaller. You would need to suspend filesystemm operations before changing it, but I'm not sure all filesystems support this. Maybe support for splitting the request in VOP_STRATEGY() or another appropriate place would be better ? -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: assumption about a device's maxphys
On Wed, Oct 10, 2012 at 12:02:21PM +0200, Manuel Bouyer wrote: There's another case, which I think is worse: a raidframe volume, with underlying disks with different maxphys. If it's a raid-1 you can't predict from which disk a read will come from, so you don't know the maxphys. Sure you do. The maxphys computation for RAIDframe is simple and consistent: it is the smallest maxphys of any component disk, multiplied by the number of data (not parity) elements in the set. It works the same for RAID1 as for any other RAID level: a RAID1 set has 1 data element (so does a RAID0 set) so it is simply the smallest maxphys of any disk in the set. I wouldn't expect this case (a volume composed of multiple disks with different maxphys) to be that common, so I'm not sure we should optimise for this. The volume's maxphys would be the lower of all the devices's maxphys. A tricky case is when the new maxphys would be smaller. You would need to suspend filesystemm operations before changing it, but I'm not sure all filesystems support this. Maybe support for splitting the request in VOP_STRATEGY() or another appropriate place would be better ? That is my thinking: a device driver whose maxphys can change should split (and potentially even combine, though this is a matter of performance not correctness -- xbdback can already do this, though) requests as necessary, since there is no real atomicity guarantee for a request larger than a single sector anyhow. I really would like to avoid reaching out through multiple layers of indirect reference (or via several stacked up function pointers, defeating branch prediction etc.) every time we schedule an I/O or even consider which pages to push or pull on a single I/O. Thor
Re: assumption about a device's maxphys
On Tue, Oct 09, 2012 at 06:16:44PM -0700, Chuck Silvers wrote: caching the device maxphys in a file system's struct mount is also problematic since a file system may directly access multiple underlying devices, as ZFS does, and those devices may each have a different maxphys. I did think about this at some length. But our kernel is shot through with assumptions that a filesystem lives on a single device, not just where I/O is dispatched but in the mount path and elsewhere. The only short path forward would seem to be virtual devices that homogenize the view of such things -- and isn't this, actually, what our ZFS implementation does? A zpool is a single device, not many, no? Doesn't even Solaris do this? I also do not believe it is appropriate for the upper layers (the physio code, the pager code, etc.) to have to reach down into the filesystem code to understand what size I/O they may do on a per-vnode basis -- or, worse, on a per-vnode, per-offset basis. Per-vnode we could accomplish by moving the maxphys value from the mount entry to the vnode, perhaps defaulting it from the mount entry at vnode creation time if the filesystem does not want to do something more sophisticated, but what if half of a file is on device X and half on device Y? I believe the lower layers should hide these things and that at some point a single, consistent value should be presented over a larger collection of objects. I believe the filesystem is the most convenient such collection. Thor
Re: assumption about a device's maxphys
On Wed, Oct 10, 2012 at 08:22:05AM -0400, Thor Lancelot Simon wrote: On Wed, Oct 10, 2012 at 12:02:21PM +0200, Manuel Bouyer wrote: There's another case, which I think is worse: a raidframe volume, with underlying disks with different maxphys. If it's a raid-1 you can't predict from which disk a read will come from, so you don't know the maxphys. Sure you do. The maxphys computation for RAIDframe is simple and consistent: it is the smallest maxphys of any component disk, multiplied by the number of data (not parity) elements in the set. It works the same for RAID1 as for any other RAID level: a RAID1 set has 1 data element (so does a RAID0 set) so it is simply the smallest maxphys of any disk in the set. But this assumes you knows the disks at config time. The problem is that you can add or remove drives from a raid volume, which may change the maxphys. I wouldn't expect this case (a volume composed of multiple disks with different maxphys) to be that common, so I'm not sure we should optimise for this. The volume's maxphys would be the lower of all the devices's maxphys. A tricky case is when the new maxphys would be smaller. You would need to suspend filesystemm operations before changing it, but I'm not sure all filesystems support this. Maybe support for splitting the request in VOP_STRATEGY() or another appropriate place would be better ? That is my thinking: a device driver whose maxphys can change should split (and potentially even combine, though this is a matter of performance not correctness -- xbdback can already do this, though) requests as necessary, since there is no real atomicity guarantee for a request larger than a single sector anyhow. This is one solution, but I think it should be centralised. No need to replicate this in every drivers which needs this. I really would like to avoid reaching out through multiple layers of indirect reference (or via several stacked up function pointers, defeating branch prediction etc.) every time we schedule an I/O or even consider which pages to push or pull on a single I/O. I agree. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: assumption about a device's maxphys
On Wed, Oct 10, 2012 at 04:04:02PM +0200, Manuel Bouyer wrote: On Wed, Oct 10, 2012 at 08:22:05AM -0400, Thor Lancelot Simon wrote: On Wed, Oct 10, 2012 at 12:02:21PM +0200, Manuel Bouyer wrote: There's another case, which I think is worse: a raidframe volume, with underlying disks with different maxphys. If it's a raid-1 you can't predict from which disk a read will come from, so you don't know the maxphys. Sure you do. The maxphys computation for RAIDframe is simple and consistent: it is the smallest maxphys of any component disk, multiplied by the number of data (not parity) elements in the set. It works the same for RAID1 as for any other RAID level: a RAID1 set has 1 data element (so does a RAID0 set) so it is simply the smallest maxphys of any disk in the set. But this assumes you knows the disks at config time. The problem is that you can add or remove drives from a raid volume, which may change the maxphys. In the short term, I think RAIDframe should simply disallow such operations. Teaching RAIDframe to split I/O is not going to be pleasant and, in practice, adding devices with smaller maxphys to an existing RAID set while it's running should be very uncommon. RAIDframe could, also, of course, impose the old MAXPHYS default on a per-component basis, ensuring we cannot ever make the problem any worse than it was -- only better. We could allow overriding this on a per-set basis. That is my thinking: a device driver whose maxphys can change should split (and potentially even combine, though this is a matter of performance not correctness -- xbdback can already do this, though) requests as necessary, since there is no real atomicity guarantee for a request larger than a single sector anyhow. This is one solution, but I think it should be centralised. No need to replicate this in every drivers which needs this. There are very, very few drivers that would ever need this -- only drivers that can hide multiple disks beneath a single virtual device. We could provide a common utility function, perhaps (I notice since the last time I looked at implementing this in disksort(), it may have become easier, because we grew something called nestiobuf, though I am not sure it is right for this purpose) but I don't think we should force all I/O through a layer that does this; it is almost never wanted. Thor
Re: call for testing: completing the device/softc split
On Wed, Oct 10, 2012 at 03:57:30AM +0900, Izumi Tsutsui wrote: chs@ wrote: I put together a patch to complete the device/softc split for the remaining drivers in the tree. the patch is at: http://ftp.netbsd.org/pub/NetBSD/misc/chs/diff.devsoftc.8 +++ sys/arch/amiga/amiga/autoconf.c 9 Oct 2012 02:53:15 - @@ -156,34 +156,52 @@ matchname(const char *fp, const char *sp * always tell the difference betwean the real and console init * by checking for NULL. */ +struct qq { + int q; + int c; +} qq; Debug leftover? oops, yes, thanks for catching that. +++ sys/dev/pci/if_devar.h 26 Sep 2012 23:44:22 - @@ -487,7 +487,7 @@ struct _tulip_softc_t { #endif /* _BSDI_VERSION 199401 */ #endif /* __bsdi__ */ #if defined(__NetBSD__) -struct device tulip_dev; /* base device */ +device_t tulip_dev;/* base device */ Some more macro? #define tulip_unit tulip_dev.dv_unit Or it's time to obsolete de driver? last I heard, there were still some tulips that worked with de and not with tlp. I'll update the de driver, though apparently no kernel config in the tree uses it... I built every kernel config for every platform to make sure things compiled. though I wonder how I missed that? I grep'd for struct device, I must have been confused by the ifdef mess in there. ... and apparently my grep didn't handle tabs between the words, I see I missed some other instances with varying whitespace. It would be nice to split the patch into two parts, cosmetic only changes (struct device * - device_t, device_xname() macro etc) and actual split (CFATTACH_DECL - CFATTACH_DECL_NEW with softc) that could have many pitfalls. (conversion between device_t and softc via (void *) casts/pointers) yea, that would have been a better way to go about it, but at this point the changes are pretty hopelessly intertwined. separating them by hand now just isn't practical. most of the device_xname() parts could probably be isolated with some text processing of the diff, and perhaps quite a bit of the struct device * - device_t parts as well. I can try doing that if more people actually want to look at the code changes. but I didn't expect that anyone would want to wade through even the interesting changes, I was really only hoping for people to test. (though actually you've caught some botches in x68k ;-) the adapt_dev thing? yea, that was one of the useful things that I found to grep for. I also noticed that this patch fixes a problem on sparc, booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion about the alldevs list being corrupted, whereas with this patch it's fine. it wasn't obvious exactly what I fixed though. -Chuck
Re: call for testing: completing the device/softc split
It would be nice to split the patch into two parts, cosmetic only changes (struct device * - device_t, device_xname() macro etc) and actual split (CFATTACH_DECL - CFATTACH_DECL_NEW with softc) that could have many pitfalls. (conversion between device_t and softc via (void *) casts/pointers) yea, that would have been a better way to go about it, but at this point the changes are pretty hopelessly intertwined. separating them by hand now just isn't practical. No problem, but I'd like to commit the following x68k ones - sys/arch/x68k/dev/ite.c - sys/arch/x68k/dev/itevar.h - sys/arch/x68k/dev/mha.c (which would be fatal) separately so that I can use one commit log to pullup request to netbsd-6. I also noticed that this patch fixes a problem on sparc, booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion about the alldevs list being corrupted, whereas with this patch it's fine. it wasn't obvious exactly what I fixed though. Some device_t/softc conversions in sys/arch/sparc/dev/fd.c and sys/arch/sparc/sparc/memecc.c seem fatal. Good catch :-) --- Izumi Tsutsui
Re: call for testing: completing the device/softc split
On Thu, Oct 11, 2012 at 12:47:56AM +0900, Izumi Tsutsui wrote: It would be nice to split the patch into two parts, cosmetic only changes (struct device * - device_t, device_xname() macro etc) and actual split (CFATTACH_DECL - CFATTACH_DECL_NEW with softc) that could have many pitfalls. (conversion between device_t and softc via (void *) casts/pointers) yea, that would have been a better way to go about it, but at this point the changes are pretty hopelessly intertwined. separating them by hand now just isn't practical. No problem, but I'd like to commit the following x68k ones - sys/arch/x68k/dev/ite.c - sys/arch/x68k/dev/itevar.h - sys/arch/x68k/dev/mha.c (which would be fatal) separately so that I can use one commit log to pullup request to netbsd-6. sure, go right ahead. everyone please feel free to commit any part of this stuff separately. I also noticed that this patch fixes a problem on sparc, booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion about the alldevs list being corrupted, whereas with this patch it's fine. it wasn't obvious exactly what I fixed though. Some device_t/softc conversions in sys/arch/sparc/dev/fd.c and sys/arch/sparc/sparc/memecc.c seem fatal. Good catch :-) cool, I'll split those out and commit them separately too. or you can do it if you'd like. -Chuck
Re: call for testing: completing the device/softc split
No problem, but I'd like to commit the following x68k ones - sys/arch/x68k/dev/ite.c - sys/arch/x68k/dev/itevar.h - sys/arch/x68k/dev/mha.c (which would be fatal) separately so that I can use one commit log to pullup request to netbsd-6. sure, go right ahead. mha.c is done. ite chnages seem to cause silent hang at early bootstrap at least on XM6i, so I'll check it on real X68030 with serial console later. (IIRC atari's ite also had annoying quirks around console probe code..) I also noticed that this patch fixes a problem on sparc, booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion about the alldevs list being corrupted, whereas with this patch it's fine. it wasn't obvious exactly what I fixed though. Some device_t/softc conversions in sys/arch/sparc/dev/fd.c and sys/arch/sparc/sparc/memecc.c seem fatal. Good catch :-) cool, I'll split those out and commit them separately too. or you can do it if you'd like. These are also committed. (at least no crash on qemu, though it doesn't have these devices) Thanks! --- Izumi Tsutsui
Re: call for testing: completing the device/softc split
I wrote: ite chnages seem to cause silent hang at early bootstrap at least on XM6i, so I'll check it on real X68030 with serial console later. (IIRC atari's ite also had annoying quirks around console probe code..) I've also fixed and committed x68k ite. It has a static struct softc for early console, so we should also prepare a static struct device after it's split out from softc. Oh well. (IIRC there were several such devices that required extra struct device and softc, like sys/dev/sbus/esp_sbus.c) --- Izumi Tsutsui
Re: call for testing: completing the device/softc split
On Oct 10, 2012, at 10:55 AM, Izumi Tsutsui wrote: I wrote: ite chnages seem to cause silent hang at early bootstrap at least on XM6i, so I'll check it on real X68030 with serial console later. (IIRC atari's ite also had annoying quirks around console probe code..) I've also fixed and committed x68k ite. It has a static struct softc for early console, so we should also prepare a static struct device after it's split out from softc. Oh well. No. Just use a softc size of 0 and set self-dv_private to the static softc in the attach and sc-sc_dev to self as well (we probably should have a device_set_private(device_t, void *) function) (IIRC there were several such devices that required extra struct device and softc, like sys/dev/sbus/esp_sbus.c)
Re: assumption about a device's maxphys
On Oct,Wednesday 10 2012, at 2:46 PM, Thor Lancelot Simon t...@panix.com wrote: On Tue, Oct 09, 2012 at 06:16:44PM -0700, Chuck Silvers wrote: caching the device maxphys in a file system's struct mount is also problematic since a file system may directly access multiple underlying devices, as ZFS does, and those devices may each have a different maxphys. I did think about this at some length. But our kernel is shot through with assumptions that a filesystem lives on a single device, not just where I/O is dispatched but in the mount path and elsewhere. The only short path forward would seem to be virtual devices that homogenize the view of such things -- and isn't this, actually, what our ZFS implementation does? A zpool is a single device, not many, no? Doesn't even Solaris do this? Situation like this is quite common even for LVM where after resize added chunk of disk space may resides on other disk which might have different maxphys. I also do not believe it is appropriate for the upper layers (the physio code, the pager code, etc.) to have to reach down into the filesystem code to understand what size I/O they may do on a per-vnode basis -- or, worse, on a per-vnode, per-offset basis. Per-vnode we could accomplish by moving the maxphys value from the mount entry to the vnode, perhaps defaulting it from the mount entry at vnode creation time if the filesystem does not want to do something more sophisticated, but what if half of a file is on device X and half on device Y? I believe the lower layers should hide these things and that at some point a single, consistent value should be presented over a larger collection of objects. I believe the filesystem is the most convenient such collection. Thor Regards Adam.