Re: [developer] O_DIRECT semantics in ZFS

2020-03-30 Thread Matthew Ahrens via openzfs-developer
On Mon, Mar 30, 2020 at 8:43 PM Richard Laager  wrote:

> On 3/30/20 10:27 PM, Matthew Ahrens via openzfs-developer wrote:
>
> On Mon, Mar 30, 2020 at 7:08 PM Richard Laager  wrote:
>
>> My only personal interest in O_DIRECT is for KVM qemu virtualization. It
>> sounds like I will probably need to set direct=disabled. Alternatively,
>> if I could get all the writes to be 4K-aligned (e.g. by making all the
>> virtual disks 4Kn?), then ZFS's O_DIRECT would work.
>>
>
> We were thinking that qemu *would* be able to use O_DIRECT, or at least it
> wouldn't need direct=disabled.  But I think your assessment implies that
> qemu usually uses O_DIRECT i/o that is not page (4K) aligned
>
> Yes, that was my assumption. Imagine the (likely still typical) case of
> 512B virtual disks. If the guest does a 512B write, is KVM really doing RMW
> to make that 4K? I'm assuming not. This very well may be a faulty
> assumption. I'm not a qemu developer.
>

QEMU could try to do the 512b O_DIRECT write, get an error, and then fall
back on 512b non-DIRECT write.  Or tell you to change your config to not
use O_DIRECT.  I can't imagine what else would work, given other
filesystems' implementations of O_DIRECT.  Given that we are proposing ZFS
have the same semantics, and we haven't heard of anyone having to tell ext4
to ignore O_DIRECT, I don't imagine you will need to tell ZFS to ignore
O_DIRECT (with direct=disabled) either.

--matt

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-Mf69b8eb6c10e94b9b6b8592e
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] O_DIRECT semantics in ZFS

2020-03-30 Thread Richard Laager
On 3/30/20 10:27 PM, Matthew Ahrens via openzfs-developer wrote:
> On Mon, Mar 30, 2020 at 7:08 PM Richard Laager  > wrote:
>
> My only personal interest in O_DIRECT is for KVM qemu
> virtualization. It
> sounds like I will probably need to set direct=disabled.
> Alternatively,
> if I could get all the writes to be 4K-aligned (e.g. by making all the
> virtual disks 4Kn?), then ZFS's O_DIRECT would work.
>
>
> We were thinking that qemu *would* be able to use O_DIRECT, or at
> least it wouldn't need direct=disabled.  But I think your assessment
> implies that qemu usually uses O_DIRECT i/o that is not page (4K) aligned

Yes, that was my assumption. Imagine the (likely still typical) case of
512B virtual disks. If the guest does a 512B write, is KVM really doing
RMW to make that 4K? I'm assuming not. This very well may be a faulty
assumption. I'm not a qemu developer.

-- 
Richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-M09e05d22672a2f0e1688853e
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] O_DIRECT semantics in ZFS

2020-03-30 Thread Matthew Ahrens via openzfs-developer
On Mon, Mar 30, 2020 at 7:08 PM Richard Laager  wrote:

> My only personal interest in O_DIRECT is for KVM qemu virtualization. It
> sounds like I will probably need to set direct=disabled. Alternatively,
> if I could get all the writes to be 4K-aligned (e.g. by making all the
> virtual disks 4Kn?), then ZFS's O_DIRECT would work.
>

We were thinking that qemu *would* be able to use O_DIRECT, or at least it
wouldn't need direct=disabled.  But I think your assessment implies that
qemu usually uses O_DIRECT i/o that is not page (4K) aligned, in which case
it would get an error.  AFAIK, all other filesystems that implement
O_DIRECT also fail on non-page-aligned i/o.  So it's surprising that qemu
would expect something other than that.  Maybe I'm missing something here?
I'm not that familiar with KVM/qemu deployments, maybe folks do usually use
4Kn virtual disks?


>
> The rest are some questions for here or the call tomorrow, if you think
> they're worthwhile:
>

Thanks for your questions.  Responses below:


>
> On 3/30/20 5:29 PM, Matthew Ahrens via openzfs-developer wrote:
> > It is also a request to optimize write throughput, even if
> > this causes in a large increase in latency of individual write requests.
>
> This was surprising to me. Can you comment on this more? Is this true
> even in scenarios like databases? (I honestly don't know. This is above
> my level of expertise.)
>

Typical O_DIRECT semantics on other filesystems is that a write call does
not return until the i/o to disk completes.  We will be doing the same with
ZFS (for block-aligned I/O).  This allows the filesystem flexibility to
handle the write with less memory and less bcopy()'s, since we can use the
user-provided buffer rather than copying it into our own buffer (and
keeping track of it, etc).  Compared to typical behavior of just copying
the data to memory (assuming they are not using O_SYNC), the latency of
O_DIRECT is often MUCH worse (milliseconds vs microseconds).  So O_DIRECT
only makes sense if the application cares much more about throughput than
latency.  They can achieve high throughput with many concurrent O_DIRECT
writes, and/or very large O_DIRECT writes.


> > For write() system calls, additional performance may be
> > achieved by setting checksum=off and not using compression,
> > encryption, RAIDZ, or mirroring.
>
> Is there a likely use case for this scenario? Databases always come up
> in O_DIRECT discussions, but having to have no redundancy to get the
> most performance is a serious limitation. (Note: I have no idea how
> expensive the one copy is.)
>

I'm not sure.  I could imagine someone comparing ZFS to an alternative
filesystem, where they are using O_DIRECT, and the alternative FS has no
checksumming, redundancy, etc.  And they want ZFS for other reasons (e.g.
snapshots, or combining this workload with others that DO need
checksumming, compression, etc).  This mode would let them get as close as
possible to the performance of an alternative, very lightweight
filesystem.  I know the Lustre folks have measured an impact of this
additional bcopy(), and they are glad that it is not needed for Lustre
(even with checksum=on, because we know Lustre won't modify the buffer
while the write is in progress).


> > “Always”: acts as though O_DIRECT was always specified
>
> What is the use case for this?
>

If the application is naive (doesn't know about / use O_DIRECT), but the
system administrator knows that the application would benefit from
O_DIRECT.  For example, some versions of "dd" don't have the
iflags/oflags=direct option.

--matt

>
>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-Mf56547ab66c4122e394413a8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] O_DIRECT semantics in ZFS

2020-03-30 Thread Richard Laager
My only personal interest in O_DIRECT is for KVM qemu virtualization. It
sounds like I will probably need to set direct=disabled. Alternatively,
if I could get all the writes to be 4K-aligned (e.g. by making all the
virtual disks 4Kn?), then ZFS's O_DIRECT would work.

The rest are some questions for here or the call tomorrow, if you think
they're worthwhile:

On 3/30/20 5:29 PM, Matthew Ahrens via openzfs-developer wrote:
> It is also a request to optimize write throughput, even if
> this causes in a large increase in latency of individual write requests.

This was surprising to me. Can you comment on this more? Is this true
even in scenarios like databases? (I honestly don't know. This is above
my level of expertise.)

> For write() system calls, additional performance may be
> achieved by setting checksum=off and not using compression,
> encryption, RAIDZ, or mirroring.

Is there a likely use case for this scenario? Databases always come up
in O_DIRECT discussions, but having to have no redundancy to get the
most performance is a serious limitation. (Note: I have no idea how
expensive the one copy is.)

> “Always”: acts as though O_DIRECT was always specified

What is the use case for this?

-- 
Richard

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-M3a4efe289df16d2eec23d18f
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


[developer] O_DIRECT semantics in ZFS

2020-03-30 Thread Matthew Ahrens via openzfs-developer
Following our discussion about O_DIRECT at the last OpenZFS Leadership
meeting

 (video ),
Mark Maybee, Brian Behlendorf, Brian Atkison and I worked through the exact
semantics that O_DIRECT should have for ZFS.  Our proposal is below, and
there are additional details in our design document
,
including other options considered, and more reasoning behind the choices
made.  Please let us know if you have any questions about this.  We will
have an opportunity to discuss it at tomorrow's meeting as well (9AM
pacific; zoom ).


*Summary of proposed OpenZFS O_DIRECT semantics:*

Broadly speaking, we interpret O_DIRECT as an indication that the user does
not expect to benefit from caching of their data, and that we should try to
improve performance by taking advantage of that expectation.  It is also a
request to optimize write throughput, even if this causes in a large
increase in latency of individual write requests.

We see O_DIRECT as a tool for sophisticated applications to get greatly
improved performance for certain workloads, especially very high throughput
workloads (gigabytes per second).  For best performance, knowledge of how
O_DIRECT behaves (on ZFS specifically) may be required. However, even naive
use of O_DIRECT will not violate ZFS’s core principles of data integrity
and ease of use, and should result in improved performance in most
circumstances.

Based on the above principles, we plan to implement the following semantics:


   -

   Coherence with buffered I/O
   -

  When a file is accessed with both O_DIRECT and buffered
  (non-O_DIRECT), all readers see the same file contents.
  -

  I.e. O_DIRECT and buffered accesses are coherent.
  -

   Reads
   -

  If the data is already cached in the ARC, or if it’s dirty in the
  DMU, it will be copied from the ARC/DMU
  -

 However, this does not count as an access for ARC retention
 purposes (i.e. the data will fall out of the cache as though
this access
 did not happen)
 -

  If access is not page-aligned (4K-aligned), the request will fail
  with an error.
  -

  The access need not be block-aligned for the i/o to be performed
  directly (bypassing the cache, reading directly into the user buffer).
  (“block”-aligned meaning dn_datablksz, which is controlled by the
  recordsize property.) The non-requested part of the block will be
  discarded. (The above caching behavior still applies - if cached we will
  read from the cache.)
  -

   Writes
   -

  If the data is cached in ARC, or if it’s dirty in the DMU, it will be
  discarded from the ARC/DMU, and the write performed directly.
  -

  If access is not page-aligned (4K-aligned), the request will fail
  with an error.
  -

  If access is not block-aligned, the write will be performed buffered
  (as though O_DIRECT was not specified).  However, if the block was not
  already cached, it will be discarded from the cache after the
TXG completes
  (i.e. after it is written to disk by spa_sync()).  This ensures that
  sequential sub-block O_DIRECT writes do not have pathologically bad
  performance.
  -

  The checksum is guaranteed to always be of the data that is written
  to disk.
  -

 If the access is from another kernel subsystem (e.g. Lustre, NFS,
 iSCSI), we can ensure that the buffer provided is not
concurrently modified
 while ZFS is accessing it.  Therefore we can send the user’s buffer
 directly to the checksumming, compression, encryption, RAID
parity routines
 and to the disk driver, without making a copy into a temporary buffer.
 -

 However, if the access is via a write() system call, then we
 assume that another user thread could be concurrently
modifying the buffer
 (via memory stores).  In this case, if:
 -

The checksum is not “off”
-

OR compression is not “off”
-

OR encryption is not “off”
-

OR RAIDZ/DRAID is used
-

OR mirroring is used
-

 THEN, we will make a temporary copy of the buffer to ensure that
 it is not modified between when the data is read by
 checksumming/compression/RAID and when it is written to disk.
 -

 For write() system calls, additional performance may be achieved
 by setting checksum=off and not using compression,
encryption, RAIDZ, or
 mirroring.
 -

   O_SYNC and O_DIRECT are orthogonal (i.e. O_DIRECT does not imply that
   the data is persistent o

[developer] Second March OpenZFS Leadership Meeting

2020-03-30 Thread Matthew Ahrens via openzfs-developer
The next OpenZFS Leadership meeting will be held tomorrow, March 31,
9am-10am Pacific time.  We have several interesting topics on the agenda
for tomorrow's meeting:


   -

   Add “zstream redup ”
   utility; remove “zfs send --dedup
   ” (Matt)
   -

   Add O_DIRECT support: design update (Matt)
   -

   New API, higher-level than libzfs
   -

   Changes that need reviewers:
   -

  Persistent L2ARC - https://github.com/openzfs/zfs/pull/9582
  -

  Introduction of ZSTD compr. to ZFS -
  https://github.com/openzfs/zfs/pull/9735
  -

  Dedup DDT load https://github.com/zfsonlinux/zfs/pull/9464/


Everyone is welcome to attend and participate, and we will try to keep the
meeting on agenda and on time.  The meetings will be held online via Zoom,
and recorded and posted to the website and YouTube after the meeting.

The agenda for the meeting will be a discussion of the projects listed in
the agenda doc.

For more information and details on how to attend, as well as notes and
video from the previous meeting, please see the agenda document:

https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit

--matt

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf086272f011d19ed-M6175fece0a7412fff23c4e9c
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription