Filesystems might have some blocksize and alignment constraints
conditioning their ability to loan up buffers (for writes). 
If that is so, we could use an API to query the FS about
those values. For a copy on write & variable block size
filesystem, that natural blocksize might also depend on the
vnode being targetted. Do we know if ZFS will ever be able to
loan up buffers for writes that are not aligned full records ?

-r

Rich.Brown at Sun.COM writes:
 > I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
 > This case proposes new interfaces to support copy reduction in the I/O path
 > especially for file sharing services.
 > 
 > Minor binding is requested.
 > 
 > This times out on Wednesday, 16 September, 2009.
 > 
 > 
 > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 > This information is Copyright 2009 Sun Microsystems
 > 1. Introduction
 >     1.1. Project/Component Working Name:
 >       Copy Reduction Interfaces
 >     1.2. Name of Document Author/Supplier:
 >       Author:  Mahesh Siddheshwar, Chunli Zhang
 >     1.3  Date of This Document:
 >      09 September, 2009
 > 4. Technical Description
 > 
 >  == Introduction/Background ==
 > 
 >  Zero-copy (copy avoidance) is essentially buffer sharing
 >  among multiple modules that pass data between the modules. 
 >  This proposal avoids the data copy in the READ/WRITE path 
 >  of filesystems, by providing a mechanism to share data buffers
 >  between the modules. It is intended to be used by network file
 >  sharing services like NFS, CIFS or others.
 > 
 >  Although the buffer sharing can be achieved through a few different
 >  solutions, any such solution must work with File Event Monitors
 >  (FEM monitors)[1] installed on the files. The solution must
 >  allow the underlying filesystem to maintain any existing file 
 >  range locking in the filesystem.
 >  
 >  The proposed solution provides extensions to the existing VOP
 >  interface to request and return buffers from a filesystem. The 
 >  buffers are then used with existing VOP_READ/VOP_WRITE calls with
 >  minimal changes.
 > 
 > 
 >  == Proposed Changes ==
 > 
 >  VOP Extensions for Zero-Copy Support
 >  ========================================
 > 
 >  a. Extended struct uio, xuio_t
 > 
 >   The following proposes an extensible uio structure that can be extended for
 >   multiple purposes.  For example, an immediate extension, xu_zc, is to be 
 >   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
 >   zero-copy buffers, as well as to be passed to the existing 
 > VOP_READ/VOP_WRITE
 >   calls for normal read/write operations.  Another example of extension,
 >   xu_aio, is intended to replace uioa_t for async I/O.
 > 
 >   This new structure, xuio_t, contains the following:
 > 
 >   - the existing uio structure (embedded) as the first member
 >   - additional fields to support extensibility
 >   - a union of all the defined extensions
 > 
 >   The following uio_extflag is added to indicate that an uio structure is
 >   indeed an xuio_t:
 > 
 >   #define    UIO_XUIO        0x004   /* Structure is xuio_t */
 > 
 >   The following uio_extflag will be removed after uioa_t has been converted 
 >   to xuio_t:
 > 
 >   #define    UIO_ASYNC       0x002   /* Structure is xuio_t */
 > 
 >   The project team has commitment from the networking team to remove
 >   the current use of uioa_t and use the proposed extensions (CR 6880095).
 > 
 >   The definition of xuio_t is:
 > 
 >   typedef struct xuio {
 >     uio_t xu_uio;            /* Embedded UIO structure */
 > 
 >     /* Extended uio fields */
 >     enum xuio_type xu_type;  /* What kind of uio structure? */
 > 
 >     union {
 > 
 >      /* Async I/O Support */
 >      struct {
 >             uint32_t xu_a_state;     /* state of async i/o */
 >             uint32_t xu_a_state;     /* state of async i/o */
 >             ssize_t xu_a_mbytes;     /* bytes that have been uioamove()ed */
 >             uioa_page_t *xu_a_lcur;  /* pointer into uioa_locked[] */
 >             void **xu_a_lppp;                /* pointer into 
 > lcur->uioa_ppp[] */
 >             void *xu_a_hwst[4];              /* opaque hardware state */
 >             uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
 > */
 >      } xu_aio;
 > 
 >      /* Zero Copy Support */
 >      struct {
 >             enum uio_rw xu_zc_rw;    /* the use of the buffer */
 >             void *xu_zc_priv;                /* fs specific */
 >      } xu_zc;
 > 
 >     } xu_ext;
 >   } xuio_t;
 > 
 >   where xu_type is currently defined as:
 > 
 >   typedef enum xuio_type {
 >     UIOTYPE_ASYNCIO,
 >     UIOTYPE_ZEROCOPY
 >   } xuio_type_t;
 > 
 >   New uio extensions can be added by defining a new xuio_type_t, and adding a
 >   new member to the xu_ext union.
 > 
 >  b. Requesting zero-copy buffers
 > 
 >     #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 >     fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
 > 
 >     int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
 >      caller_context_t *);
 >  
 >     This function requests buffers associated with file vp in preparation 
 > for a
 >     subsequent zero copy read or write. The extended uio_t -- xuio_t is used
 >     to pass the parameters and results. Only the following fields of xuio_t 
 > are
 >     relevant to this call.
 >  
 >     uiozcp->xu_uio.uio_resid: used by the caller to specify the total length
 >          of the buffer.
 > 
 >     uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file 
 > offset
 >          it would like the buffers to be associated with. A value of -1 
 >          indicates that the provider returns buffers that are not associated
 >          with a particular offset.  These are defined to be anonymous 
 > buffers.
 >          Anonymous buffers may be used for requesting a write buffer to 
 > receive
 >          data over the wire, where file offset might not be handily 
 > available.
 > 
 >     uiozcp->xu_uio.uio_iov: used by the provider to return an array of 
 > buffers
 >          (in case multiple filesystem buffers have to be reserved for the
 >          requested length).
 > 
 >     uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number of
 >          returned buffers (length of array uiop->uio_iov).
 > 
 >     Other arguments to the call include:
 > 
 >     vp:  vnode pointer of the associated file.
 > 
 >     rwflag: Indicates what the buffers are to be subsequently used for.
 >             Expected values are UIO_READ for VOP_READ() and UIO_WRITE for
 >             VOP_WRITE().
 > 
 >     Upon successful completion, the function returns 0. One or more
 >     buffers may be returned as referenced by uio_iov[] and uio_iovcnt 
 > members.
 >     uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is set
 >     to UIOTYPE_ZEROCOPY.
 > 
 >     The caller can use this returned xuio_t in a subsequent call to VOP_READ
 >     or VOP_WRITE. In the case of UIO_READ buffers, the caller should
 >     reference the uio_iov[] buffers only after a successful VOP_READ().
 >     In the case of UIO_WRITE buffers, the caller should not reference
 >     the uio_iov[] buffers after a successful VOP_WRITE.
 > 
 >     In the case of anonymous buffers, the caller should set the value of 
 >     uio_loffset before such a read/write call. This should be done only in 
 >     the case of anonymous buffers. 
 > 
 >     The member xu_zc_priv of the extended uio structure for zero-copy is 
 >     a private handle that may be used by the provider to track its buffer
 >     headers or any other private information that is useful to map the 
 >     loaned iovec entries to its internal buffers. The xu_zc_priv member
 >     is private to the provider and should not be changed or interpreted 
 >     in anyway by the callers.
 > 
 >     Upon failure, the function returns EINVAL error and the content
 >     of uiozcp should be ignored by the callers. The provider must fail the
 >     request if it is unable to satisfy the complete request (ie. it must
 >     not return buffers that cover only a part of the length that was
 >     asked for).
 > 
 >     Probable causes for failure include:
 > 
 >     - the filesystem is short on buffers to loan out at the time
 >     - the filesystem determines that it's not efficient to take the
 >       zero-copy path based on the input parameters
 >     
 >  c. Returning zero-copy buffers
 > 
 >     #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \
 >     fop_retzcbuf(vp, uiozcp, cr, ct)
 > 
 >     int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *);
 >  
 >     This function returns the buffers previously obtained via a call
 >     to VOP_REQZCBUF(). In case multiple buffers are associated with the
 >     uio_iov[], all the buffers associated with the uiozcp are returned.
 >     In other words, VOP_RETZCBUF() should only be called once per xuio_t.
 >     The caller should not reference any of the uio_iov[] members after
 >     a return.
 > 
 >  d. New VFS feature attributes
 > 
 >     A new VFS feature attribute is introduced for the support of
 >     zero-copy interface.
 > 
 >   #define VFSFT_ZEROCOPY_SUPPORTED     0x100000100
 > 
 >    Zero-copy is an optional feature. A filesystem supporting the
 >    zero-copy interface (ie. the Interface Provider) must set this
 >    VFS feature attribute through the VFS Feature Registration
 >    interface[2]. Callers of the interface (ie. Interface Consumer)
 >    must check the presence of support through vfs_has_feature() interface.
 >    The intermediate fop routines (called via the VOP_* macros) will detect
 >    if the interfaces are being called for a filesystem that does not support
 >    zero-copy and will return ENOTSUP.
 > 
 >  INTERFACE TABLE
 >  
 > +==========================================================================================+
 >                             |Proposed       |Specified   |
 >                             |Stability      |in what     |
 >   Interface Name            |Classification |Document?   | Comments
 >  
 > +==========================================================================================+
 >    VOP_REQZCBUF()           |Consolidation  |This        | New VOP calls
 >    fop_reqzcbuf()           |Private        |Document    |
 >    VOP_RETZCBUF()           |               |            |
 >    fop_retzcbuf()           |               |            |
 >                             |               |            |
 >    VFSFT_ZEROCOPY_SUPPORTED |               |            | New VFS feature 
 > definition
 >                             |               |            |
 >    xuio_t                   |               |            | Extended uio_t 
 > definition
 >                             |               |            |
 >                             |               |            |
 >    uioa_t                   |               |            | Deprecated
 >    UIO_ASYNC                |               |            | Deprecated
 >  
 > +==========================================================================================+
 > 
 >  * The project's deliverables will all go into the OS/NET
 >    Consolidation, so no contracts are required.
 > 
 > 
 >  == Using the New VOP Interfaces for Zero-copy ==
 > 
 >  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
 >  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 
 > 
 >  a. Read
 > 
 >     In a normal read, the consumer allocates the data buffer and passes it to
 >     VOP_READ().  The provider initiates the I/O, and copies the data from its
 >     own cache buffer to the consumer supplied buffer.
 > 
 >     To avoid the copy (initiating a zero-copy read), the consumer first calls
 >     VOP_REQZCBUF() to inform the provider to prepare to loan out its cache
 >     buffer.  It then calls VOP_READ().  After the call returns, the consumer
 >     has direct access to the cache buffer loaned out by the provider.  After
 >     processing the data, the consumer calls VOP_RETZCBUF() to return the 
 > loaned
 >     cache buffer to the provider.
 > 
 >     Here is an illustration using NFSv4 read over TCP:
 > 
 >         rfs4_op_read(nfs_argop4 *argop, ...)
 >         {
 >             int zerocopy;
 >             xuio_t *xuio;
 >             ...
 >             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
 >             setup length, offset, etc;
 >             if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) {
 >                 zerocopy = 0;
 >                 allocate the data buffer the normal way;
 >                 initialize (uio_t *)xuio;
 >             } else {
 >                 /* xuio has been setup by the provider */
 >                 zerocopy = 1;
 >             }
 >             do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct);
 >             ...
 >             if (zerocopy) {
 >                 setup callback mechanism that makes the network layer call
 >                 VOP_RETZCBUF() and free xuio after the data is sent out;
 >             } else {
 >                 kmem_free(xuio, sizeof(xuio_t));
 >             }
 >         }
 > 
 >  b. Write
 > 
 >     In a normal write, the consumer allocates the data buffer, loads the 
 > data,
 >     and passes the buffer to VOP_WRITE().  The provider copies the data from
 >     the consumer supplied buffer to its own cache buffer, and starts the I/O.
 > 
 >     To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
 >     grab a cache buffer from the provider.  It loads the data directly to
 >     the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
 >     the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 >     the provider.
 > 
 >     Here is an illustration using NFSv4 write via RDMA:
 > 
 >         rfs4_op_write(nfs_argop4 *argop, ...)
 >         {
 >             int zerocopy;
 >             xuio_t *xuio;
 >             ...
 >             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
 >             setup length, offset, etc;
 >             if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) {
 >                 zerocopy = 0;
 >                 allocate the data buffer the normal way;
 >                 initialize (uio_t *)xuio;
 >                 xdrrdma_read_from_client(...);
 >             } else {
 >                 /* xuio has been setup by the provider */
 >                 zerocopy = 1;
 >                 xdrrdma_zcopy_read_from_client(..., xuio);
 >             }
 >             do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct);
 >             ...
 >             if (zerocopy) {
 >                 VOP_RETZCBUF(vp, xuio, cr, &ct);
 >             }
 >             kmem_free(xuio, sizeof(xuio_t));
 >         }
 > 
 > 
 >  References:
 >   [1] PSARC/2003/172 File Event Monitoring 
 >   [2] PSARC/2007/227 VFS Features 
 > 
 > 
 > 6. Resources and Schedule
 >     6.4. Steering Committee requested information
 >      6.4.1. Consolidation C-team Name:
 >              ON
 >     6.5. ARC review type: FastTrack
 >     6.6. ARC Exposure: open
 > 

Reply via email to