Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread Roch - PAE
[EMAIL PROTECTED] writes:
   Note also that for most applications, the size of their IO operations
   would often not match the current page size of the buffer, causing
   additional performance and scalability issues.
  
  Thanks for mentioning this, I forgot about it.
  
  Since ZFS's default block size is configured to be larger than a page,
  the application would have to issue page-aligned block-sized I/Os.
  Anyone adjusting the block size would presumably be responsible for
  ensuring that the new size is a multiple of the page size.  (If they
  would want Direct I/O to work...)
  
  I believe UFS also has a similar requirement, but I've been wrong
  before.
  

I believe the UFS requirement is that the I/O be sector
aligned for DIO to be attempted. And Anton did mention that
one of the benefit of DIO is the ability to direct-read a
subpage block. Without UFS/DIO the OS is required to read and
cache the full page and the extra amount of I/O may lead to
data channel saturation (I don't see latency as an issue in
here, right ?).

This is where I said that such a feature would translate
for ZFS into the ability to read parts of a filesystem block 
which would only make sense if checksums are disabled.

And for RAID-Z that could mean avoiding I/Os to each disks but 
one in a group, so that's a nice benefit.

So  for the  performance  minded customer that can't  afford
mirroring, is not  much a fan  of data integrity, that needs
to do subblock reads to an  uncacheable workload, then I can
see a feature popping up. And this feature is independant on
whether   or not the data  is  DMA'ed straight into the user
buffer.

The  other  feature,  is to  avoid a   bcopy by  DMAing full
filesystem block reads straight into user buffer (and verify
checksum after). The I/O is high latency, bcopy adds a small
amount. The kernel memory can  be freed/reuse straight after
the user read  completes. This is  where I ask, how much CPU
is lost to the bcopy in workloads that benefit from DIO ?

At this point, there are lots of projects  that will lead to
performance improvements.  The DIO benefits seems like small
change in the context of ZFS.

The quickest return on  investement  I see for  the  directio
hint would be to tell ZFS to not grow the ARC when servicing
such requests.


-r



  -j
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread Jonathan Edwards


On Jan 24, 2007, at 06:54, Roch - PAE wrote:


[EMAIL PROTECTED] writes:
Note also that for most applications, the size of their IO  
operations

would often not match the current page size of the buffer, causing
additional performance and scalability issues.


Thanks for mentioning this, I forgot about it.

Since ZFS's default block size is configured to be larger than a  
page,

the application would have to issue page-aligned block-sized I/Os.
Anyone adjusting the block size would presumably be responsible for
ensuring that the new size is a multiple of the page size.  (If they
would want Direct I/O to work...)

I believe UFS also has a similar requirement, but I've been wrong
before.



I believe the UFS requirement is that the I/O be sector
aligned for DIO to be attempted. And Anton did mention that
one of the benefit of DIO is the ability to direct-read a
subpage block. Without UFS/DIO the OS is required to read and
cache the full page and the extra amount of I/O may lead to
data channel saturation (I don't see latency as an issue in
here, right ?).


In QFS there are mount options to do automatic type switching
depending on whether or not the IO is sector aligned or not.  You
essentially set a trigger to switch to DIO if you receive a tunable
number of well aligned IO requests.  This helps tremendously in
certain streaming workloads (particularly write) to reduce overhead.


This is where I said that such a feature would translate
for ZFS into the ability to read parts of a filesystem block
which would only make sense if checksums are disabled.


would it be possible to do checksums a posteri? .. i suspect that
the checksum portion of the transaction may not be atomic though
and this leads us back towards the older notion of a DIF.


And for RAID-Z that could mean avoiding I/Os to each disks but
one in a group, so that's a nice benefit.

So  for the  performance  minded customer that can't  afford
mirroring, is not  much a fan  of data integrity, that needs
to do subblock reads to an  uncacheable workload, then I can
see a feature popping up. And this feature is independant on
whether   or not the data  is  DMA'ed straight into the user
buffer.


certain streaming write workloads that are time dependent can
fall into this category .. if i'm doing a DMA read directly from a
device's buffer that i'd like to stream - i probably want to avoid
some of the caching layers of indirection that will probably impose
more overhead.

The idea behind allowing an application to advise the filesystem
of how it plans on doing it's IO (or the state of it's own cache or
buffers or stream requirements) is to prevent the one cache fits
all sort of approach that we currently seem to have in the ARC.


The  other  feature,  is to  avoid a   bcopy by  DMAing full
filesystem block reads straight into user buffer (and verify
checksum after). The I/O is high latency, bcopy adds a small
amount. The kernel memory can  be freed/reuse straight after
the user read  completes. This is  where I ask, how much CPU
is lost to the bcopy in workloads that benefit from DIO ?


But isn't the cost more than just the bcopy?  Isn't there additional
overhead in the TLB/PTE from the page invalidation that needs
to occur when you do actually go to write the page out or flush
the page?


At this point, there are lots of projects  that will lead to
performance improvements.  The DIO benefits seems like small
change in the context of ZFS.

The quickest return on  investement  I see for  the  directio
hint would be to tell ZFS to not grow the ARC when servicing
such requests.


How about the notion of multiple ARCs that could be referenced
or fine tuned for various types of IO workload profiles to provide a
more granular approach?  Wouldn't this also keep the page tables
smaller and hopefully more contiguous for atomic operations? Not
sure what this would break ..

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread johansen-osdev
 And this feature is independant on whether   or not the data  is
 DMA'ed straight into the user buffer.

I suppose so, however, it seems like it would make more sense to
configure a dataset property that specifically describes the caching
policy that is desired.  When directio implies different semantics for
different filesystems, customers are going to get confused.

 The  other  feature,  is to  avoid a   bcopy by  DMAing full
 filesystem block reads straight into user buffer (and verify
 checksum after). The I/O is high latency, bcopy adds a small
 amount. The kernel memory can  be freed/reuse straight after
 the user read  completes. This is  where I ask, how much CPU
 is lost to the bcopy in workloads that benefit from DIO ?

Right, except that if we try to DMA into user buffers with ZFS there's a
bunch of other things we need the VM to do on our behalf to protect the
integrity of the kernel data that's living in user pages.  Assume you
have a high-latency I/O and you've locked some user pages for this I/O.
In a pathological case, when another thread tries to access the locked
pages and then also blocks,  it does so for the duration of the first
thread's I/O.  At that point, it seems like it might be easier to accept
the cost of the bcopy instead of blocking another thread.

I'm not even sure how to assess the impact of VM operations required to
change the permissions on the pages before we start the I/O.

 The quickest return on  investement  I see for  the  directio
 hint would be to tell ZFS to not grow the ARC when servicing
 such requests.

Perhaps if we had an option that specifies not to cache data from a
particular dataset, that would suffice.  I think you've filed a CR along
those lines already (6429855)?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread Jonathan Edwards

Roch

I've been chewing on this for a little while and had some thoughts

On Jan 15, 2007, at 12:02, Roch - PAE wrote:



Jonathan Edwards writes:


On Jan 5, 2007, at 11:10, Anton B. Rang wrote:


DIRECT IO is a set of performance optimisations to circumvent
shortcomings of a given filesystem.


Direct I/O as generally understood (i.e. not UFS-specific) is an
optimization which allows data to be transferred directly between
user data buffers and disk, without a memory-to-memory copy.

This isn't related to a particular file system.



true .. directio(3) is generally used in the context of *any* given
filesystem to advise it that an application buffer to system buffer
copy may get in the way or add additional overhead (particularly if
the filesystem buffer is doing additional copies.)  You can also look
at it as a way of reducing more layers of indirection particularly if
I want the application overhead to be higher than the subsystem
overhead.  Programmatically .. less is more.


Direct IO makes good sense when the target disk sectors are
set a priori. But in the context of ZFS, would you rather
have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that
was possible).


sure, but in a well designed filesystem this is essentially the
same as efficient buffer cache utilization .. coalescing IO
operations to commit on a more efficient and larger disk
allocation unit.  However, paged IO (and in particular ZFS
paged IO) is probably a little more than simply a bcopy()
in comparison to Direct IO (at least in the QFS context)


As for read, I  can see that when  the load is cached in the
disk array and we're running  100% CPU, the extra copy might
be noticeable. Is this the   situation that longs for DIO  ?
What % of a system is spent in the copy  ? What is the added
latency that comes from the copy ? Is DIO the best way to
reduce the CPU cost of ZFS ?


To achieve maximum IO rates (in particular if you have a flexible
blocksize and know the optimal stripe width for the best raw disk
or array logical volume performance) you're going to do much
better if you don't have to pass through buffered IO strategies
with the added latencies and kernel space dependencies.

Consider the case where you're copying or replicating from one
disk device to another in a one-time shot.  There's tremendous
advantage in bypassing the buffer and reading and writing full
stripe passes.  The additional buffer copy is also going to add
latency and affect your run queue, particularly if you're working
on a shared system as the buffer cache might get affected by
memory pressure, kernel interrupts, or other applications.

Another common case could be line speed network data capture
if the frame size is already well aligned for the storage device.
Being able to attach one device to another with minimal kernel
intervention should be seen as an advantage for a wide range
of applications that need to stream data from device A to device
B and already know more than you might about both devices.


The  current Nevada  code base  has  quite nice  performance
characteristics  (and  certainly   quirks); there are   many
further efficiency gains to be reaped from ZFS. I just don't
see DIO on top of  that list for now.   Or at least  someone
needs to  spell out what  is ZFS/DIO and  how much better it
is expected to be (back of the envelope calculation accepted).


the real benefit is measured more in terms of memory consumption
for a given application and the type of balance between application
memory space and filesystem memory space.  when the filesystem
imposes more pressure on the application due to it's mapping you're
really measuring the impact of doing an application buffer read and
copy for each write.  In other words you're imposing more of a limit
on how the application should behave with respect to it's notion of
the storage device.

DIO should not been seen as a catchall for the notion of more
efficiency will be gotten by bypassing the filesystem buffers but
rather as please don't buffer this since you might push back on
me and I don't know if I can handle a push back advice


Reading RAID-Z  subblocks on filesystems that  have checksum
disabled might be interesting.   That would avoid  some disk
seeks.To served  the  subblocks directly   or  not is  a
separate matter; it's  a small deal  compared to the feature
itself.  How about disabling the  DB  checksum (it can't fix
the block anyway) and do mirroring ?


Basically speaking - there needs to be some sort of strategy for
bypassing the ARC or even parts of the ARC for applications that
may need to advise the filesystem of either:
1) the delicate nature of imposing additional buffering for their
data flow
2) already well optimized applications that need more adaptive
cache in the application instead of the underlying filesystem or
volume manager

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
 Basically speaking - there needs to be some sort of strategy for
 bypassing the ARC or even parts of the ARC for applications that
 may need to advise the filesystem of either:
 1) the delicate nature of imposing additional buffering for their
 data flow
 2) already well optimized applications that need more adaptive
 cache in the application instead of the underlying filesystem or
 volume manager

This advice can't be sensibly delivered to ZFS via a Direct I/O
mechanism.  Anton's characterization of Direct I/O as, an optimization
which allows data to be transferred directly between user data buffers
and disk, without a memory-to-memory copy, is concise and accurate.
Trying to intuit advice from this is unlikely to be useful.  It would be
better to develop a separate mechanism for delivering advice about the
application to the filesystem.  (fadvise, perhaps?)

A DIO implementation for ZFS is more complicated than UFS and adversely
impacts well optimized applications.

I looked into this late last year when we had a customer who was
suffering from too much bcopy overhead.  Billm found another workaround
instead of bypassing the ARC.

The challenge for implementing DIO for ZFS is in dealing with access to
the pages mapped by the user application.  Since ZFS has to checksum all
of its data, the user's pages that are involved in the direct I/O cannot
be written to by another thread during the I/O.  If this policy isn't
enforced, it is possible for the data written to or read from disk to be
different from their checksums.

In order to protect the user pages while a DIO is in progress, we want
support from the VM that isn't presently implemented.  To prevent a page
from being accessed by another thread, we have to unmap the TLB/PTE
entries and lock the page.  There's a cost associated with this, as it
may be necessary to cross-call other CPUs.  Any thread that accesses the
locked pages will block.  While it's possible lock pages in the VM
today, there isn't a neat set of interfaces the filesystem can use to
maintain the integrity of the user's buffers.  Without an experimental
prototype to verify the design, it's impossible to say whether overhead
of manipulating the page permissions is more than the cost of bypassing
the cache.

What do you see as potential use cases for ZFS Direct I/O?  I'm having a
hard time imagining a situation in which this would be useful to a
customer.  The application would probably have to be single-threaded,
and if not, it would have to be pretty careful about how its threads
access buffers involved in I/O.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread Bart Smaalders

[EMAIL PROTECTED] wrote:

In order to protect the user pages while a DIO is in progress, we want
support from the VM that isn't presently implemented.  To prevent a page
from being accessed by another thread, we have to unmap the TLB/PTE
entries and lock the page.  There's a cost associated with this, as it
may be necessary to cross-call other CPUs.  Any thread that accesses the
locked pages will block.  While it's possible lock pages in the VM
today, there isn't a neat set of interfaces the filesystem can use to
maintain the integrity of the user's buffers.  Without an experimental
prototype to verify the design, it's impossible to say whether overhead
of manipulating the page permissions is more than the cost of bypassing
the cache.


Note also that for most applications, the size of their IO operations
would often not match the current page size of the buffer, causing
additional performance and scalability issues.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
 Note also that for most applications, the size of their IO operations
 would often not match the current page size of the buffer, causing
 additional performance and scalability issues.

Thanks for mentioning this, I forgot about it.

Since ZFS's default block size is configured to be larger than a page,
the application would have to issue page-aligned block-sized I/Os.
Anyone adjusting the block size would presumably be responsible for
ensuring that the new size is a multiple of the page size.  (If they
would want Direct I/O to work...)

I believe UFS also has a similar requirement, but I've been wrong
before.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-15 Thread Roch - PAE

Jonathan Edwards writes:
  
  On Jan 5, 2007, at 11:10, Anton B. Rang wrote:
  
   DIRECT IO is a set of performance optimisations to circumvent  
   shortcomings of a given filesystem.
  
   Direct I/O as generally understood (i.e. not UFS-specific) is an  
   optimization which allows data to be transferred directly between  
   user data buffers and disk, without a memory-to-memory copy.
  
   This isn't related to a particular file system.
  
  
  true .. directio(3) is generally used in the context of *any* given  
  filesystem to advise it that an application buffer to system buffer  
  copy may get in the way or add additional overhead (particularly if  
  the filesystem buffer is doing additional copies.)  You can also look  
  at it as a way of reducing more layers of indirection particularly if  
  I want the application overhead to be higher than the subsystem  
  overhead.  Programmatically .. less is more.

Direct IO makes good sense when the target disk sectors are
set a priori. But in the context of ZFS, would you rather
have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that
was possible).

As for read, I  can see that when  the load is cached in the
disk array and we're running  100% CPU, the extra copy might
be noticeable. Is this the   situation that longs for DIO  ?
What % of a system is spent in the copy  ? What is the added
latency that comes from the copy ? Is DIO the best way to
reduce the CPU cost of ZFS ?

The  current Nevada  code base  has  quite nice  performance
characteristics  (and  certainly   quirks); there are   many
further efficiency gains to be reaped from ZFS. I just don't
see DIO on top of  that list for now.   Or at least  someone
needs to  spell out what  is ZFS/DIO and  how much better it
is expected to be (back of the envelope calculation accepted).

Reading RAID-Z  subblocks on filesystems that  have checksum
disabled might be interesting.   That would avoid  some disk
seeks.To served  the  subblocks directly   or  not is  a
separate matter; it's  a small deal  compared to the feature
itself.  How about disabling the  DB  checksum (it can't fix
the block anyway) and do mirroring ?

-r


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-15 Thread Jason J. W. Williams

Hi Roch,

You mentioned improved ZFS performance in the latest Nevada build (60
right now?)...I was curious if one would notice much of a performance
improvement between 54 and 60? Also, does anyone think the zfs_arc_max
tunable-support will be made available as a patch to S10U3, or would
that wait until U4? Thank you in advance!

Best Regards,
Jason

On 1/15/07, Roch - PAE [EMAIL PROTECTED] wrote:


Jonathan Edwards writes:
 
  On Jan 5, 2007, at 11:10, Anton B. Rang wrote:
 
   DIRECT IO is a set of performance optimisations to circumvent
   shortcomings of a given filesystem.
  
   Direct I/O as generally understood (i.e. not UFS-specific) is an
   optimization which allows data to be transferred directly between
   user data buffers and disk, without a memory-to-memory copy.
  
   This isn't related to a particular file system.
  
 
  true .. directio(3) is generally used in the context of *any* given
  filesystem to advise it that an application buffer to system buffer
  copy may get in the way or add additional overhead (particularly if
  the filesystem buffer is doing additional copies.)  You can also look
  at it as a way of reducing more layers of indirection particularly if
  I want the application overhead to be higher than the subsystem
  overhead.  Programmatically .. less is more.

Direct IO makes good sense when the target disk sectors are
set a priori. But in the context of ZFS, would you rather
have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that
was possible).

As for read, I  can see that when  the load is cached in the
disk array and we're running  100% CPU, the extra copy might
be noticeable. Is this the   situation that longs for DIO  ?
What % of a system is spent in the copy  ? What is the added
latency that comes from the copy ? Is DIO the best way to
reduce the CPU cost of ZFS ?

The  current Nevada  code base  has  quite nice  performance
characteristics  (and  certainly   quirks); there are   many
further efficiency gains to be reaped from ZFS. I just don't
see DIO on top of  that list for now.   Or at least  someone
needs to  spell out what  is ZFS/DIO and  how much better it
is expected to be (back of the envelope calculation accepted).

Reading RAID-Z  subblocks on filesystems that  have checksum
disabled might be interesting.   That would avoid  some disk
seeks.To served  the  subblocks directly   or  not is  a
separate matter; it's  a small deal  compared to the feature
itself.  How about disabling the  DB  checksum (it can't fix
the block anyway) and do mirroring ?

-r


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-05 Thread Jonathan Edwards


On Jan 5, 2007, at 11:10, Anton B. Rang wrote:

DIRECT IO is a set of performance optimisations to circumvent  
shortcomings of a given filesystem.


Direct I/O as generally understood (i.e. not UFS-specific) is an  
optimization which allows data to be transferred directly between  
user data buffers and disk, without a memory-to-memory copy.


This isn't related to a particular file system.



true .. directio(3) is generally used in the context of *any* given  
filesystem to advise it that an application buffer to system buffer  
copy may get in the way or add additional overhead (particularly if  
the filesystem buffer is doing additional copies.)  You can also look  
at it as a way of reducing more layers of indirection particularly if  
I want the application overhead to be higher than the subsystem  
overhead.  Programmatically .. less is more.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss