Re: assumption about a device's maxphys

2012-10-10 Thread Manuel Bouyer
On Tue, Oct 09, 2012 at 06:16:44PM -0700, Chuck Silvers wrote:
 hi thor,
 
 your tls-maxphys branch appears to assume that a given device's maxphys is 
 constant,
 but consider the case of a software device like a logical volume from LVM.
 one very nice feature of a volume manager that I used to work with (veritas 
 vxvm)
 is that it can migrate a volume to new storage on the fly, while the data in 
 it
 is being read and written.  with such a virtual device, I assume that 
 intention
 would be that the maxphys of the volume would be the same as the maxphys of 
 the
 underlying physical device, but the maxphys of the underlying physical device
 could be different before and after a volume migration, so that doesn't fit
 very well with the assumption that a device's maxphys doesn't change.
 
 caching the device maxphys in a file system's struct mount is also 
 problematic
 since a file system may directly access multiple underlying devices, as ZFS 
 does,
 and those devices may each have a different maxphys.
 
 could you think about how to enhance the new maxphys design to accomodate 
 these cases?

There's another case, which I think is worse: a raidframe volume, with
underlying disks with different maxphys. If it's a raid-1 you can't predict
from which disk a read will come from, so you don't know the maxphys.

I wouldn't expect this case (a volume composed of multiple disks with
different maxphys) to be that common, so I'm not sure we should optimise
for this. The volume's maxphys would be the lower of all the devices's
maxphys.

A tricky case is when the new maxphys would be smaller. You would need
to suspend filesystemm operations before changing it, but I'm not sure
all filesystems support this. Maybe support for splitting the request in
VOP_STRATEGY() or another appropriate place would be better ?

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: assumption about a device's maxphys

2012-10-10 Thread Thor Lancelot Simon
On Wed, Oct 10, 2012 at 12:02:21PM +0200, Manuel Bouyer wrote:
 
 There's another case, which I think is worse: a raidframe volume, with
 underlying disks with different maxphys. If it's a raid-1 you can't predict
 from which disk a read will come from, so you don't know the maxphys.

Sure you do.  The maxphys computation for RAIDframe is simple and
consistent: it is the smallest maxphys of any component disk, multiplied
by the number of data (not parity) elements in the set.

It works the same for RAID1 as for any other RAID level: a RAID1 set has
1 data element (so does a RAID0 set) so it is simply the smallest maxphys
of any disk in the set.

 I wouldn't expect this case (a volume composed of multiple disks with
 different maxphys) to be that common, so I'm not sure we should optimise
 for this. The volume's maxphys would be the lower of all the devices's
 maxphys.
 
 A tricky case is when the new maxphys would be smaller. You would need
 to suspend filesystemm operations before changing it, but I'm not sure
 all filesystems support this. Maybe support for splitting the request in
 VOP_STRATEGY() or another appropriate place would be better ?

That is my thinking: a device driver whose maxphys can change should split
(and potentially even combine, though this is a matter of performance not
correctness -- xbdback can already do this, though) requests as
necessary, since there is no real atomicity guarantee for a request
larger than a single sector anyhow.

I really would like to avoid reaching out through multiple layers of
indirect reference (or via several stacked up function pointers,
defeating branch prediction etc.) every time we schedule an I/O or
even consider which pages to push or pull on a single I/O.

Thor


Re: assumption about a device's maxphys

2012-10-10 Thread Thor Lancelot Simon
On Tue, Oct 09, 2012 at 06:16:44PM -0700, Chuck Silvers wrote:
 
 caching the device maxphys in a file system's struct mount is also
 problematic since a file system may directly access multiple underlying
 devices, as ZFS does, and those devices may each have a different maxphys.

I did think about this at some length.  But our kernel is shot through with
assumptions that a filesystem lives on a single device, not just where I/O
is dispatched but in the mount path and elsewhere.  The only short path
forward would seem to be virtual devices that homogenize the view of such
things -- and isn't this, actually, what our ZFS implementation does?  A
zpool is a single device, not many, no?  Doesn't even Solaris do this?

I also do not believe it is appropriate for the upper layers (the physio
code, the pager code, etc.) to have to reach down into the filesystem code
to understand what size I/O they may do on a per-vnode basis -- or, worse,
on a per-vnode, per-offset basis.  Per-vnode we could accomplish by moving
the maxphys value from the mount entry to the vnode, perhaps defaulting it
from the mount entry at vnode creation time if the filesystem does not want
to do something more sophisticated, but what if half of a file is on device
X and half on device Y?  I believe the lower layers should hide these things
and that at some point a single, consistent value should be presented over
a larger collection of objects.  I believe the filesystem is the most
convenient such collection.

Thor


Re: assumption about a device's maxphys

2012-10-10 Thread Manuel Bouyer
On Wed, Oct 10, 2012 at 08:22:05AM -0400, Thor Lancelot Simon wrote:
 On Wed, Oct 10, 2012 at 12:02:21PM +0200, Manuel Bouyer wrote:
  
  There's another case, which I think is worse: a raidframe volume, with
  underlying disks with different maxphys. If it's a raid-1 you can't predict
  from which disk a read will come from, so you don't know the maxphys.
 
 Sure you do.  The maxphys computation for RAIDframe is simple and
 consistent: it is the smallest maxphys of any component disk, multiplied
 by the number of data (not parity) elements in the set.
 
 It works the same for RAID1 as for any other RAID level: a RAID1 set has
 1 data element (so does a RAID0 set) so it is simply the smallest maxphys
 of any disk in the set.

But this assumes you knows the disks at config time. The problem is that
you can add or remove drives from a raid volume, which may change the
maxphys. 

 
  I wouldn't expect this case (a volume composed of multiple disks with
  different maxphys) to be that common, so I'm not sure we should optimise
  for this. The volume's maxphys would be the lower of all the devices's
  maxphys.
  
  A tricky case is when the new maxphys would be smaller. You would need
  to suspend filesystemm operations before changing it, but I'm not sure
  all filesystems support this. Maybe support for splitting the request in
  VOP_STRATEGY() or another appropriate place would be better ?
 
 That is my thinking: a device driver whose maxphys can change should split
 (and potentially even combine, though this is a matter of performance not
 correctness -- xbdback can already do this, though) requests as
 necessary, since there is no real atomicity guarantee for a request
 larger than a single sector anyhow.

This is one solution, but I think it should be centralised. No need to
replicate this in every drivers which needs this.

 
 I really would like to avoid reaching out through multiple layers of
 indirect reference (or via several stacked up function pointers,
 defeating branch prediction etc.) every time we schedule an I/O or
 even consider which pages to push or pull on a single I/O.

I agree.

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: assumption about a device's maxphys

2012-10-10 Thread Thor Lancelot Simon
On Wed, Oct 10, 2012 at 04:04:02PM +0200, Manuel Bouyer wrote:
 On Wed, Oct 10, 2012 at 08:22:05AM -0400, Thor Lancelot Simon wrote:
  On Wed, Oct 10, 2012 at 12:02:21PM +0200, Manuel Bouyer wrote:
   
   There's another case, which I think is worse: a raidframe volume, with
   underlying disks with different maxphys. If it's a raid-1 you can't 
   predict
   from which disk a read will come from, so you don't know the maxphys.
  
  Sure you do.  The maxphys computation for RAIDframe is simple and
  consistent: it is the smallest maxphys of any component disk, multiplied
  by the number of data (not parity) elements in the set.
  
  It works the same for RAID1 as for any other RAID level: a RAID1 set has
  1 data element (so does a RAID0 set) so it is simply the smallest maxphys
  of any disk in the set.
 
 But this assumes you knows the disks at config time. The problem is that
 you can add or remove drives from a raid volume, which may change the
 maxphys. 

In the short term, I think RAIDframe should simply disallow such
operations.  Teaching RAIDframe to split I/O is not going to be pleasant
and, in practice, adding devices with smaller maxphys to an existing
RAID set while it's running should be very uncommon.

RAIDframe could, also, of course, impose the old MAXPHYS default on a
per-component basis, ensuring we cannot ever make the problem any worse
than it was -- only better.  We could allow overriding this on a per-set
basis.

  That is my thinking: a device driver whose maxphys can change should split
  (and potentially even combine, though this is a matter of performance not
  correctness -- xbdback can already do this, though) requests as
  necessary, since there is no real atomicity guarantee for a request
  larger than a single sector anyhow.
 
 This is one solution, but I think it should be centralised. No need to
 replicate this in every drivers which needs this.

There are very, very few drivers that would ever need this -- only drivers
that can hide multiple disks beneath a single virtual device.  We could
provide a common utility function, perhaps (I notice since the last time
I looked at implementing this in disksort(), it may have become easier,
because we grew something called nestiobuf, though I am not sure it is
right for this purpose) but I don't think we should force all I/O through
a layer that does this; it is almost never wanted.

Thor


Re: call for testing: completing the device/softc split

2012-10-10 Thread Chuck Silvers
On Wed, Oct 10, 2012 at 03:57:30AM +0900, Izumi Tsutsui wrote:
 chs@ wrote:
 
  I put together a patch to complete the device/softc split
  for the remaining drivers in the tree.  the patch is at:
  http://ftp.netbsd.org/pub/NetBSD/misc/chs/diff.devsoftc.8
 
  +++ sys/arch/amiga/amiga/autoconf.c 9 Oct 2012 02:53:15 -
  @@ -156,34 +156,52 @@ matchname(const char *fp, const char *sp
* always tell the difference betwean the real and console init
* by checking for NULL.
*/
  +struct qq {
  +  int q;
  +  int c;
  +} qq;
 
 Debug leftover?

oops, yes, thanks for catching that.


  +++ sys/dev/pci/if_devar.h  26 Sep 2012 23:44:22 -
  @@ -487,7 +487,7 @@ struct _tulip_softc_t {
   #endif /* _BSDI_VERSION  199401 */
   #endif /* __bsdi__ */
   #if defined(__NetBSD__)
  -struct device tulip_dev;   /* base device */
  +device_t tulip_dev;/* base device */
 
 Some more macro?
  #define tulip_unit tulip_dev.dv_unit
 
 Or it's time to obsolete de driver?

last I heard, there were still some tulips that worked with de and not with tlp.
I'll update the de driver, though apparently no kernel config in the tree uses 
it...
I built every kernel config for every platform to make sure things compiled.

though I wonder how I missed that?  I grep'd for struct device,
I must have been confused by the ifdef mess in there.
... and apparently my grep didn't handle tabs between the words,
I see I missed some other instances with varying whitespace.


 It would be nice to split the patch into two parts,
 cosmetic only changes (struct device * - device_t,
 device_xname() macro etc) and actual split
 (CFATTACH_DECL - CFATTACH_DECL_NEW with softc)
 that could have many pitfalls. (conversion between
 device_t and softc via (void *) casts/pointers)

yea, that would have been a better way to go about it,
but at this point the changes are pretty hopelessly intertwined.
separating them by hand now just isn't practical.
most of the device_xname() parts could probably be isolated with some
text processing of the diff, and perhaps quite a bit of the
struct device * - device_t parts as well.
I can try doing that if more people actually want to look at
the code changes.

but I didn't expect that anyone would want to wade through even the
interesting changes, I was really only hoping for people to test.


 (though actually you've caught some botches in x68k ;-)

the adapt_dev thing?  yea, that was one of the useful things
that I found to grep for.

I also noticed that this patch fixes a problem on sparc,
booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion
about the alldevs list being corrupted, whereas with this patch
it's fine.  it wasn't obvious exactly what I fixed though.

-Chuck


Re: call for testing: completing the device/softc split

2012-10-10 Thread Izumi Tsutsui
  It would be nice to split the patch into two parts,
  cosmetic only changes (struct device * - device_t,
  device_xname() macro etc) and actual split
  (CFATTACH_DECL - CFATTACH_DECL_NEW with softc)
  that could have many pitfalls. (conversion between
  device_t and softc via (void *) casts/pointers)
 
 yea, that would have been a better way to go about it,
 but at this point the changes are pretty hopelessly intertwined.
 separating them by hand now just isn't practical.

No problem, but I'd like to commit the following x68k ones
 - sys/arch/x68k/dev/ite.c
 - sys/arch/x68k/dev/itevar.h
 - sys/arch/x68k/dev/mha.c
(which would be fatal) separately so that I can use one commit log
to pullup request to netbsd-6.

 I also noticed that this patch fixes a problem on sparc,
 booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion
 about the alldevs list being corrupted, whereas with this patch
 it's fine.  it wasn't obvious exactly what I fixed though.

Some device_t/softc conversions in sys/arch/sparc/dev/fd.c and
sys/arch/sparc/sparc/memecc.c seem fatal.
Good catch :-)

---
Izumi Tsutsui


Re: call for testing: completing the device/softc split

2012-10-10 Thread Chuck Silvers
On Thu, Oct 11, 2012 at 12:47:56AM +0900, Izumi Tsutsui wrote:
   It would be nice to split the patch into two parts,
   cosmetic only changes (struct device * - device_t,
   device_xname() macro etc) and actual split
   (CFATTACH_DECL - CFATTACH_DECL_NEW with softc)
   that could have many pitfalls. (conversion between
   device_t and softc via (void *) casts/pointers)
  
  yea, that would have been a better way to go about it,
  but at this point the changes are pretty hopelessly intertwined.
  separating them by hand now just isn't practical.
 
 No problem, but I'd like to commit the following x68k ones
  - sys/arch/x68k/dev/ite.c
  - sys/arch/x68k/dev/itevar.h
  - sys/arch/x68k/dev/mha.c
 (which would be fatal) separately so that I can use one commit log
 to pullup request to netbsd-6.

sure, go right ahead.

everyone please feel free to commit any part of this stuff separately.


  I also noticed that this patch fixes a problem on sparc,
  booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion
  about the alldevs list being corrupted, whereas with this patch
  it's fine.  it wasn't obvious exactly what I fixed though.
 
 Some device_t/softc conversions in sys/arch/sparc/dev/fd.c and
 sys/arch/sparc/sparc/memecc.c seem fatal.
 Good catch :-)

cool, I'll split those out and commit them separately too.
or you can do it if you'd like.

-Chuck


Re: call for testing: completing the device/softc split

2012-10-10 Thread Izumi Tsutsui
  No problem, but I'd like to commit the following x68k ones
   - sys/arch/x68k/dev/ite.c
   - sys/arch/x68k/dev/itevar.h
   - sys/arch/x68k/dev/mha.c
  (which would be fatal) separately so that I can use one commit log
  to pullup request to netbsd-6.
 
 sure, go right ahead.

mha.c is done.

ite chnages seem to cause silent hang at early bootstrap at least on XM6i,
so I'll check it on real X68030 with serial console later.
(IIRC atari's ite also had annoying quirks around console probe code..)

   I also noticed that this patch fixes a problem on sparc,
   booting a -current sparc GENERIC + QUEUEDEBUG hits an assertion
   about the alldevs list being corrupted, whereas with this patch
   it's fine.  it wasn't obvious exactly what I fixed though.
  
  Some device_t/softc conversions in sys/arch/sparc/dev/fd.c and
  sys/arch/sparc/sparc/memecc.c seem fatal.
  Good catch :-)
 
 cool, I'll split those out and commit them separately too.
 or you can do it if you'd like.

These are also committed.
(at least no crash on qemu, though it doesn't have these devices)

Thanks!
---
Izumi Tsutsui


Re: call for testing: completing the device/softc split

2012-10-10 Thread Izumi Tsutsui
I wrote:

 ite chnages seem to cause silent hang at early bootstrap at least on XM6i,
 so I'll check it on real X68030 with serial console later.
 (IIRC atari's ite also had annoying quirks around console probe code..)

I've also fixed and committed x68k ite.

It has a static struct softc for early console,
so we should also prepare a static struct device
after it's split out from softc.  Oh well.

(IIRC there were several such devices that required extra struct device
 and softc, like sys/dev/sbus/esp_sbus.c)

---
Izumi Tsutsui 


Re: call for testing: completing the device/softc split

2012-10-10 Thread Matt Thomas

On Oct 10, 2012, at 10:55 AM, Izumi Tsutsui wrote:

 I wrote:
 
 ite chnages seem to cause silent hang at early bootstrap at least on XM6i,
 so I'll check it on real X68030 with serial console later.
 (IIRC atari's ite also had annoying quirks around console probe code..)
 
 I've also fixed and committed x68k ite.
 
 It has a static struct softc for early console,
 so we should also prepare a static struct device
 after it's split out from softc.  Oh well.

No.  Just use a softc size of 0 and set self-dv_private to the static softc in 
the attach and sc-sc_dev to self as well

(we probably should have a device_set_private(device_t, void *) function)

 (IIRC there were several such devices that required extra struct device
 and softc, like sys/dev/sbus/esp_sbus.c)



Re: assumption about a device's maxphys

2012-10-10 Thread Adam Hamsik

On Oct,Wednesday 10 2012, at 2:46 PM, Thor Lancelot Simon t...@panix.com 
wrote:

 On Tue, Oct 09, 2012 at 06:16:44PM -0700, Chuck Silvers wrote:
 
 caching the device maxphys in a file system's struct mount is also
 problematic since a file system may directly access multiple underlying
 devices, as ZFS does, and those devices may each have a different maxphys.
 
 I did think about this at some length.  But our kernel is shot through with
 assumptions that a filesystem lives on a single device, not just where I/O
 is dispatched but in the mount path and elsewhere.  The only short path
 forward would seem to be virtual devices that homogenize the view of such
 things -- and isn't this, actually, what our ZFS implementation does?  A
 zpool is a single device, not many, no?  Doesn't even Solaris do this?

Situation like this is quite common even for LVM where after resize added 
chunk of disk space may resides on other disk which might have different 
maxphys.

 
 I also do not believe it is appropriate for the upper layers (the physio
 code, the pager code, etc.) to have to reach down into the filesystem code
 to understand what size I/O they may do on a per-vnode basis -- or, worse,
 on a per-vnode, per-offset basis.  Per-vnode we could accomplish by moving
 the maxphys value from the mount entry to the vnode, perhaps defaulting it
 from the mount entry at vnode creation time if the filesystem does not want
 to do something more sophisticated, but what if half of a file is on device
 X and half on device Y?  I believe the lower layers should hide these things
 and that at some point a single, consistent value should be presented over
 a larger collection of objects.  I believe the filesystem is the most
 convenient such collection.
 
 Thor

Regards

Adam.