Are there instances where an assigned zero-copy buffer could be orphaned?
If so, should there be a recovery list associated with this addition? 
Perhaps off
the designated vnode.

This comment shouldn't block fast-track approval. Just a question.
--
Rick

On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote:
> I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
> This case proposes new interfaces to support copy reduction in the I/O path
> especially for file sharing services.
>
> Minor binding is requested.
>
> This times out on Wednesday, 16 September, 2009.
>
>
> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
> This information is Copyright 2009 Sun Microsystems
> 1. Introduction
>     1.1. Project/Component Working Name:
>        Copy Reduction Interfaces
>     1.2. Name of Document Author/Supplier:
>        Author:  Mahesh Siddheshwar, Chunli Zhang
>     1.3  Date of This Document:
>       09 September, 2009
> 4. Technical Description
>
>  == Introduction/Background ==
>
>  Zero-copy (copy avoidance) is essentially buffer sharing
>  among multiple modules that pass data between the modules. 
>  This proposal avoids the data copy in the READ/WRITE path 
>  of filesystems, by providing a mechanism to share data buffers
>  between the modules. It is intended to be used by network file
>  sharing services like NFS, CIFS or others.
>
>  Although the buffer sharing can be achieved through a few different
>  solutions, any such solution must work with File Event Monitors
>  (FEM monitors)[1] installed on the files. The solution must
>  allow the underlying filesystem to maintain any existing file 
>  range locking in the filesystem.
>  
>  The proposed solution provides extensions to the existing VOP
>  interface to request and return buffers from a filesystem. The 
>  buffers are then used with existing VOP_READ/VOP_WRITE calls with
>  minimal changes.
>
>
>  == Proposed Changes ==
>
>  VOP Extensions for Zero-Copy Support
>  ========================================
>
>  a. Extended struct uio, xuio_t
>
>   The following proposes an extensible uio structure that can be extended for
>   multiple purposes.  For example, an immediate extension, xu_zc, is to be 
>   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
>   zero-copy buffers, as well as to be passed to the existing 
> VOP_READ/VOP_WRITE
>   calls for normal read/write operations.  Another example of extension,
>   xu_aio, is intended to replace uioa_t for async I/O.
>
>   This new structure, xuio_t, contains the following:
>
>   - the existing uio structure (embedded) as the first member
>   - additional fields to support extensibility
>   - a union of all the defined extensions
>
>   The following uio_extflag is added to indicate that an uio structure is
>   indeed an xuio_t:
>
>   #define     UIO_XUIO        0x004   /* Structure is xuio_t */
>
>   The following uio_extflag will be removed after uioa_t has been converted 
>   to xuio_t:
>
>   #define     UIO_ASYNC       0x002   /* Structure is xuio_t */
>
>   The project team has commitment from the networking team to remove
>   the current use of uioa_t and use the proposed extensions (CR 6880095).
>
>   The definition of xuio_t is:
>
>   typedef struct xuio {
>     uio_t xu_uio;             /* Embedded UIO structure */
>
>     /* Extended uio fields */
>     enum xuio_type xu_type;   /* What kind of uio structure? */
>
>     union {
>
>       /* Async I/O Support */
>       struct {
>             uint32_t xu_a_state;      /* state of async i/o */
>             uint32_t xu_a_state;      /* state of async i/o */
>             ssize_t xu_a_mbytes;      /* bytes that have been uioamove()ed */
>             uioa_page_t *xu_a_lcur;   /* pointer into uioa_locked[] */
>             void **xu_a_lppp;         /* pointer into lcur->uioa_ppp[] */
>             void *xu_a_hwst[4];               /* opaque hardware state */
>             uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
> */
>       } xu_aio;
>
>       /* Zero Copy Support */
>       struct {
>             enum uio_rw xu_zc_rw;     /* the use of the buffer */
>             void *xu_zc_priv;         /* fs specific */
>       } xu_zc;
>
>     } xu_ext;
>   } xuio_t;
>
>   where xu_type is currently defined as:
>
>   typedef enum xuio_type {
>     UIOTYPE_ASYNCIO,
>     UIOTYPE_ZEROCOPY
>   } xuio_type_t;
>
>   New uio extensions can be added by defining a new xuio_type_t, and adding a
>   new member to the xu_ext union.
>
>  b. Requesting zero-copy buffers
>
>     #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
>     fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
>
>     int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
>       caller_context_t *);
>  
>     This function requests buffers associated with file vp in preparation for 
> a
>     subsequent zero copy read or write. The extended uio_t -- xuio_t is used
>     to pass the parameters and results. Only the following fields of xuio_t 
> are
>     relevant to this call.
>  
>     uiozcp->xu_uio.uio_resid: used by the caller to specify the total length
>          of the buffer.
>
>     uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file offset
>          it would like the buffers to be associated with. A value of -1 
>          indicates that the provider returns buffers that are not associated
>          with a particular offset.  These are defined to be anonymous buffers.
>          Anonymous buffers may be used for requesting a write buffer to 
> receive
>          data over the wire, where file offset might not be handily available.
>
>     uiozcp->xu_uio.uio_iov: used by the provider to return an array of buffers
>          (in case multiple filesystem buffers have to be reserved for the
>          requested length).
>
>     uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number of
>          returned buffers (length of array uiop->uio_iov).
>
>     Other arguments to the call include:
>
>     vp:  vnode pointer of the associated file.
>
>     rwflag: Indicates what the buffers are to be subsequently used for.
>             Expected values are UIO_READ for VOP_READ() and UIO_WRITE for
>             VOP_WRITE().
>
>     Upon successful completion, the function returns 0. One or more
>     buffers may be returned as referenced by uio_iov[] and uio_iovcnt members.
>     uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is set
>     to UIOTYPE_ZEROCOPY.
>
>     The caller can use this returned xuio_t in a subsequent call to VOP_READ
>     or VOP_WRITE. In the case of UIO_READ buffers, the caller should
>     reference the uio_iov[] buffers only after a successful VOP_READ().
>     In the case of UIO_WRITE buffers, the caller should not reference
>     the uio_iov[] buffers after a successful VOP_WRITE.
>
>     In the case of anonymous buffers, the caller should set the value of 
>     uio_loffset before such a read/write call. This should be done only in 
>     the case of anonymous buffers. 
>
>     The member xu_zc_priv of the extended uio structure for zero-copy is 
>     a private handle that may be used by the provider to track its buffer
>     headers or any other private information that is useful to map the 
>     loaned iovec entries to its internal buffers. The xu_zc_priv member
>     is private to the provider and should not be changed or interpreted 
>     in anyway by the callers.
>
>     Upon failure, the function returns EINVAL error and the content
>     of uiozcp should be ignored by the callers. The provider must fail the
>     request if it is unable to satisfy the complete request (ie. it must
>     not return buffers that cover only a part of the length that was
>     asked for).
>
>     Probable causes for failure include:
>
>     - the filesystem is short on buffers to loan out at the time
>     - the filesystem determines that it's not efficient to take the
>       zero-copy path based on the input parameters
>     
>  c. Returning zero-copy buffers
>
>     #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \
>     fop_retzcbuf(vp, uiozcp, cr, ct)
>
>     int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *);
>  
>     This function returns the buffers previously obtained via a call
>     to VOP_REQZCBUF(). In case multiple buffers are associated with the
>     uio_iov[], all the buffers associated with the uiozcp are returned.
>     In other words, VOP_RETZCBUF() should only be called once per xuio_t.
>     The caller should not reference any of the uio_iov[] members after
>     a return.
>
>  d. New VFS feature attributes
>
>     A new VFS feature attribute is introduced for the support of
>     zero-copy interface.
>
>   #define VFSFT_ZEROCOPY_SUPPORTED     0x100000100
>
>    Zero-copy is an optional feature. A filesystem supporting the
>    zero-copy interface (ie. the Interface Provider) must set this
>    VFS feature attribute through the VFS Feature Registration
>    interface[2]. Callers of the interface (ie. Interface Consumer)
>    must check the presence of support through vfs_has_feature() interface.
>    The intermediate fop routines (called via the VOP_* macros) will detect
>    if the interfaces are being called for a filesystem that does not support
>    zero-copy and will return ENOTSUP.
>
>  INTERFACE TABLE
>  
> +==========================================================================================+
>                             |Proposed       |Specified   |
>                             |Stability      |in what     |
>   Interface Name            |Classification |Document?   | Comments
>  
> +==========================================================================================+
>    VOP_REQZCBUF()           |Consolidation  |This        | New VOP calls
>    fop_reqzcbuf()           |Private        |Document    |
>    VOP_RETZCBUF()           |               |            |
>    fop_retzcbuf()           |               |            |
>                             |               |            |
>    VFSFT_ZEROCOPY_SUPPORTED |               |            | New VFS feature 
> definition
>                             |               |            |
>    xuio_t                   |               |            | Extended uio_t 
> definition
>                             |               |            |
>                             |               |            |
>    uioa_t                   |               |            | Deprecated
>    UIO_ASYNC                |               |            | Deprecated
>  
> +==========================================================================================+
>
>  * The project's deliverables will all go into the OS/NET
>    Consolidation, so no contracts are required.
>
>
>  == Using the New VOP Interfaces for Zero-copy ==
>
>  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
>  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 
>
>  a. Read
>
>     In a normal read, the consumer allocates the data buffer and passes it to
>     VOP_READ().  The provider initiates the I/O, and copies the data from its
>     own cache buffer to the consumer supplied buffer.
>
>     To avoid the copy (initiating a zero-copy read), the consumer first calls
>     VOP_REQZCBUF() to inform the provider to prepare to loan out its cache
>     buffer.  It then calls VOP_READ().  After the call returns, the consumer
>     has direct access to the cache buffer loaned out by the provider.  After
>     processing the data, the consumer calls VOP_RETZCBUF() to return the 
> loaned
>     cache buffer to the provider.
>
>     Here is an illustration using NFSv4 read over TCP:
>
>         rfs4_op_read(nfs_argop4 *argop, ...)
>         {
>             int zerocopy;
>             xuio_t *xuio;
>             ...
>             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
>             setup length, offset, etc;
>             if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) {
>                 zerocopy = 0;
>                 allocate the data buffer the normal way;
>                 initialize (uio_t *)xuio;
>             } else {
>                 /* xuio has been setup by the provider */
>                 zerocopy = 1;
>             }
>             do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct);
>             ...
>             if (zerocopy) {
>                 setup callback mechanism that makes the network layer call
>                 VOP_RETZCBUF() and free xuio after the data is sent out;
>             } else {
>                 kmem_free(xuio, sizeof(xuio_t));
>             }
>         }
>
>  b. Write
>
>     In a normal write, the consumer allocates the data buffer, loads the data,
>     and passes the buffer to VOP_WRITE().  The provider copies the data from
>     the consumer supplied buffer to its own cache buffer, and starts the I/O.
>
>     To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
>     grab a cache buffer from the provider.  It loads the data directly to
>     the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
>     the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
>     the provider.
>
>     Here is an illustration using NFSv4 write via RDMA:
>
>         rfs4_op_write(nfs_argop4 *argop, ...)
>         {
>             int zerocopy;
>             xuio_t *xuio;
>             ...
>             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
>             setup length, offset, etc;
>             if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) {
>                 zerocopy = 0;
>                 allocate the data buffer the normal way;
>                 initialize (uio_t *)xuio;
>                 xdrrdma_read_from_client(...);
>             } else {
>                 /* xuio has been setup by the provider */
>                 zerocopy = 1;
>                 xdrrdma_zcopy_read_from_client(..., xuio);
>             }
>             do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct);
>             ...
>             if (zerocopy) {
>                 VOP_RETZCBUF(vp, xuio, cr, &ct);
>             }
>             kmem_free(xuio, sizeof(xuio_t));
>         }
>
>
>  References:
>   [1] PSARC/2003/172 File Event Monitoring 
>   [2] PSARC/2007/227 VFS Features 
>
>
> 6. Resources and Schedule
>     6.4. Steering Committee requested information
>       6.4.1. Consolidation C-team Name:
>               ON
>     6.5. ARC review type: FastTrack
>     6.6. ARC Exposure: open
>
>   


-- 
---------------------------------------------------------------------
Rick Matthews                           email: Rick.Matthews at sun.com
Sun Microsystems, Inc.                  phone:+1(651) 554-1518
1270 Eagan Industrial Road              phone(internal): 54418
Suite 160                               fax:  +1(651) 554-1540
Eagan, MN 55121-1231 USA                main: +1(651) 554-1500          
---------------------------------------------------------------------

Reply via email to