My issues have been resolved. Thanks Mahesh.

-r

Rich.Brown at Sun.COM writes:

 > I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
 > This case proposes new interfaces to support copy reduction in the I/O path
 > especially for file sharing services.
 > 
 > Minor binding is requested.
 > 
 > This times out on Wednesday, 16 September, 2009.
 > 
 > 
 > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 > This information is Copyright 2009 Sun Microsystems
 > 1. Introduction
 >     1.1. Project/Component Working Name:
 >       Copy Reduction Interfaces
 >     1.2. Name of Document Author/Supplier:
 >       Author:  Mahesh Siddheshwar, Chunli Zhang
 >     1.3  Date of This Document:
 >      09 September, 2009
 > 4. Technical Description
 > 
 >  == Introduction/Background ==
 > 
 >  Zero-copy (copy avoidance) is essentially buffer sharing
 >  among multiple modules that pass data between the modules. 
 >  This proposal avoids the data copy in the READ/WRITE path 
 >  of filesystems, by providing a mechanism to share data buffers
 >  between the modules. It is intended to be used by network file
 >  sharing services like NFS, CIFS or others.
 > 
 >  Although the buffer sharing can be achieved through a few different
 >  solutions, any such solution must work with File Event Monitors
 >  (FEM monitors)[1] installed on the files. The solution must
 >  allow the underlying filesystem to maintain any existing file 
 >  range locking in the filesystem.
 >  
 >  The proposed solution provides extensions to the existing VOP
 >  interface to request and return buffers from a filesystem. The 
 >  buffers are then used with existing VOP_READ/VOP_WRITE calls with
 >  minimal changes.
 > 
 > 
 >  == Proposed Changes ==
 > 
 >  VOP Extensions for Zero-Copy Support
 >  ========================================
 > 
 >  a. Extended struct uio, xuio_t
 > 
 >   The following proposes an extensible uio structure that can be extended for
 >   multiple purposes.  For example, an immediate extension, xu_zc, is to be 
 >   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
 >   zero-copy buffers, as well as to be passed to the existing 
 > VOP_READ/VOP_WRITE
 >   calls for normal read/write operations.  Another example of extension,
 >   xu_aio, is intended to replace uioa_t for async I/O.
 > 
 >   This new structure, xuio_t, contains the following:
 > 
 >   - the existing uio structure (embedded) as the first member
 >   - additional fields to support extensibility
 >   - a union of all the defined extensions
 > 
 >   The following uio_extflag is added to indicate that an uio structure is
 >   indeed an xuio_t:
 > 
 >   #define    UIO_XUIO        0x004   /* Structure is xuio_t */
 > 
 >   The following uio_extflag will be removed after uioa_t has been converted 
 >   to xuio_t:
 > 
 >   #define    UIO_ASYNC       0x002   /* Structure is xuio_t */
 > 
 >   The project team has commitment from the networking team to remove
 >   the current use of uioa_t and use the proposed extensions (CR 6880095).
 > 
 >   The definition of xuio_t is:
 > 
 >   typedef struct xuio {
 >     uio_t xu_uio;            /* Embedded UIO structure */
 > 
 >     /* Extended uio fields */
 >     enum xuio_type xu_type;  /* What kind of uio structure? */
 > 
 >     union {
 > 
 >      /* Async I/O Support */
 >      struct {
 >             uint32_t xu_a_state;     /* state of async i/o */
 >             uint32_t xu_a_state;     /* state of async i/o */
 >             ssize_t xu_a_mbytes;     /* bytes that have been uioamove()ed */
 >             uioa_page_t *xu_a_lcur;  /* pointer into uioa_locked[] */
 >             void **xu_a_lppp;                /* pointer into 
 > lcur->uioa_ppp[] */
 >             void *xu_a_hwst[4];              /* opaque hardware state */
 >             uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
 > */
 >      } xu_aio;
 > 
 >      /* Zero Copy Support */
 >      struct {
 >             enum uio_rw xu_zc_rw;    /* the use of the buffer */
 >             void *xu_zc_priv;                /* fs specific */
 >      } xu_zc;
 > 
 >     } xu_ext;
 >   } xuio_t;
 > 
 >   where xu_type is currently defined as:
 > 
 >   typedef enum xuio_type {
 >     UIOTYPE_ASYNCIO,
 >     UIOTYPE_ZEROCOPY
 >   } xuio_type_t;
 > 
 >   New uio extensions can be added by defining a new xuio_type_t, and adding a
 >   new member to the xu_ext union.
 > 
 >  b. Requesting zero-copy buffers
 > 
 >     #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 >     fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
 > 
 >     int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
 >      caller_context_t *);
 >  
 >     This function requests buffers associated with file vp in preparation 
 > for a
 >     subsequent zero copy read or write. The extended uio_t -- xuio_t is used
 >     to pass the parameters and results. Only the following fields of xuio_t 
 > are
 >     relevant to this call.
 >  
 >     uiozcp->xu_uio.uio_resid: used by the caller to specify the total length
 >          of the buffer.
 > 
 >     uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file 
 > offset
 >          it would like the buffers to be associated with. A value of -1 
 >          indicates that the provider returns buffers that are not associated
 >          with a particular offset.  These are defined to be anonymous 
 > buffers.
 >          Anonymous buffers may be used for requesting a write buffer to 
 > receive
 >          data over the wire, where file offset might not be handily 
 > available.
 > 
 >     uiozcp->xu_uio.uio_iov: used by the provider to return an array of 
 > buffers
 >          (in case multiple filesystem buffers have to be reserved for the
 >          requested length).
 > 
 >     uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number of
 >          returned buffers (length of array uiop->uio_iov).
 > 
 >     Other arguments to the call include:
 > 
 >     vp:  vnode pointer of the associated file.
 > 
 >     rwflag: Indicates what the buffers are to be subsequently used for.
 >             Expected values are UIO_READ for VOP_READ() and UIO_WRITE for
 >             VOP_WRITE().
 > 
 >     Upon successful completion, the function returns 0. One or more
 >     buffers may be returned as referenced by uio_iov[] and uio_iovcnt 
 > members.
 >     uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is set
 >     to UIOTYPE_ZEROCOPY.
 > 
 >     The caller can use this returned xuio_t in a subsequent call to VOP_READ
 >     or VOP_WRITE. In the case of UIO_READ buffers, the caller should
 >     reference the uio_iov[] buffers only after a successful VOP_READ().
 >     In the case of UIO_WRITE buffers, the caller should not reference
 >     the uio_iov[] buffers after a successful VOP_WRITE.
 > 
 >     In the case of anonymous buffers, the caller should set the value of 
 >     uio_loffset before such a read/write call. This should be done only in 
 >     the case of anonymous buffers. 
 > 
 >     The member xu_zc_priv of the extended uio structure for zero-copy is 
 >     a private handle that may be used by the provider to track its buffer
 >     headers or any other private information that is useful to map the 
 >     loaned iovec entries to its internal buffers. The xu_zc_priv member
 >     is private to the provider and should not be changed or interpreted 
 >     in anyway by the callers.
 > 
 >     Upon failure, the function returns EINVAL error and the content
 >     of uiozcp should be ignored by the callers. The provider must fail the
 >     request if it is unable to satisfy the complete request (ie. it must
 >     not return buffers that cover only a part of the length that was
 >     asked for).
 > 
 >     Probable causes for failure include:
 > 
 >     - the filesystem is short on buffers to loan out at the time
 >     - the filesystem determines that it's not efficient to take the
 >       zero-copy path based on the input parameters
 >     
 >  c. Returning zero-copy buffers
 > 
 >     #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \
 >     fop_retzcbuf(vp, uiozcp, cr, ct)
 > 
 >     int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *);
 >  
 >     This function returns the buffers previously obtained via a call
 >     to VOP_REQZCBUF(). In case multiple buffers are associated with the
 >     uio_iov[], all the buffers associated with the uiozcp are returned.
 >     In other words, VOP_RETZCBUF() should only be called once per xuio_t.
 >     The caller should not reference any of the uio_iov[] members after
 >     a return.
 > 
 >  d. New VFS feature attributes
 > 
 >     A new VFS feature attribute is introduced for the support of
 >     zero-copy interface.
 > 
 >   #define VFSFT_ZEROCOPY_SUPPORTED     0x100000100
 > 
 >    Zero-copy is an optional feature. A filesystem supporting the
 >    zero-copy interface (ie. the Interface Provider) must set this
 >    VFS feature attribute through the VFS Feature Registration
 >    interface[2]. Callers of the interface (ie. Interface Consumer)
 >    must check the presence of support through vfs_has_feature() interface.
 >    The intermediate fop routines (called via the VOP_* macros) will detect
 >    if the interfaces are being called for a filesystem that does not support
 >    zero-copy and will return ENOTSUP.
 > 
 >  INTERFACE TABLE
 >  
 > +==========================================================================================+
 >                             |Proposed       |Specified   |
 >                             |Stability      |in what     |
 >   Interface Name            |Classification |Document?   | Comments
 >  
 > +==========================================================================================+
 >    VOP_REQZCBUF()           |Consolidation  |This        | New VOP calls
 >    fop_reqzcbuf()           |Private        |Document    |
 >    VOP_RETZCBUF()           |               |            |
 >    fop_retzcbuf()           |               |            |
 >                             |               |            |
 >    VFSFT_ZEROCOPY_SUPPORTED |               |            | New VFS feature 
 > definition
 >                             |               |            |
 >    xuio_t                   |               |            | Extended uio_t 
 > definition
 >                             |               |            |
 >                             |               |            |
 >    uioa_t                   |               |            | Deprecated
 >    UIO_ASYNC                |               |            | Deprecated
 >  
 > +==========================================================================================+
 > 
 >  * The project's deliverables will all go into the OS/NET
 >    Consolidation, so no contracts are required.
 > 
 > 
 >  == Using the New VOP Interfaces for Zero-copy ==
 > 
 >  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
 >  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 
 > 
 >  a. Read
 > 
 >     In a normal read, the consumer allocates the data buffer and passes it to
 >     VOP_READ().  The provider initiates the I/O, and copies the data from its
 >     own cache buffer to the consumer supplied buffer.
 > 
 >     To avoid the copy (initiating a zero-copy read), the consumer first calls
 >     VOP_REQZCBUF() to inform the provider to prepare to loan out its cache
 >     buffer.  It then calls VOP_READ().  After the call returns, the consumer
 >     has direct access to the cache buffer loaned out by the provider.  After
 >     processing the data, the consumer calls VOP_RETZCBUF() to return the 
 > loaned
 >     cache buffer to the provider.
 > 
 >     Here is an illustration using NFSv4 read over TCP:
 > 
 >         rfs4_op_read(nfs_argop4 *argop, ...)
 >         {
 >             int zerocopy;
 >             xuio_t *xuio;
 >             ...
 >             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
 >             setup length, offset, etc;
 >             if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) {
 >                 zerocopy = 0;
 >                 allocate the data buffer the normal way;
 >                 initialize (uio_t *)xuio;
 >             } else {
 >                 /* xuio has been setup by the provider */
 >                 zerocopy = 1;
 >             }
 >             do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct);
 >             ...
 >             if (zerocopy) {
 >                 setup callback mechanism that makes the network layer call
 >                 VOP_RETZCBUF() and free xuio after the data is sent out;
 >             } else {
 >                 kmem_free(xuio, sizeof(xuio_t));
 >             }
 >         }
 > 
 >  b. Write
 > 
 >     In a normal write, the consumer allocates the data buffer, loads the 
 > data,
 >     and passes the buffer to VOP_WRITE().  The provider copies the data from
 >     the consumer supplied buffer to its own cache buffer, and starts the I/O.
 > 
 >     To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
 >     grab a cache buffer from the provider.  It loads the data directly to
 >     the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
 >     the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 >     the provider.
 > 
 >     Here is an illustration using NFSv4 write via RDMA:
 > 
 >         rfs4_op_write(nfs_argop4 *argop, ...)
 >         {
 >             int zerocopy;
 >             xuio_t *xuio;
 >             ...
 >             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
 >             setup length, offset, etc;
 >             if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) {
 >                 zerocopy = 0;
 >                 allocate the data buffer the normal way;
 >                 initialize (uio_t *)xuio;
 >                 xdrrdma_read_from_client(...);
 >             } else {
 >                 /* xuio has been setup by the provider */
 >                 zerocopy = 1;
 >                 xdrrdma_zcopy_read_from_client(..., xuio);
 >             }
 >             do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct);
 >             ...
 >             if (zerocopy) {
 >                 VOP_RETZCBUF(vp, xuio, cr, &ct);
 >             }
 >             kmem_free(xuio, sizeof(xuio_t));
 >         }
 > 
 > 
 >  References:
 >   [1] PSARC/2003/172 File Event Monitoring 
 >   [2] PSARC/2007/227 VFS Features 
 > 
 > 
 > 6. Resources and Schedule
 >     6.4. Steering Committee requested information
 >      6.4.1. Consolidation C-team Name:
 >              ON
 >     6.5. ARC review type: FastTrack
 >     6.6. ARC Exposure: open
 > 

Reply via email to