My issues have been resolved. Thanks Mahesh. -r
Rich.Brown at Sun.COM writes: > I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. > This case proposes new interfaces to support copy reduction in the I/O path > especially for file sharing services. > > Minor binding is requested. > > This times out on Wednesday, 16 September, 2009. > > > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI > This information is Copyright 2009 Sun Microsystems > 1. Introduction > 1.1. Project/Component Working Name: > Copy Reduction Interfaces > 1.2. Name of Document Author/Supplier: > Author: Mahesh Siddheshwar, Chunli Zhang > 1.3 Date of This Document: > 09 September, 2009 > 4. Technical Description > > == Introduction/Background == > > Zero-copy (copy avoidance) is essentially buffer sharing > among multiple modules that pass data between the modules. > This proposal avoids the data copy in the READ/WRITE path > of filesystems, by providing a mechanism to share data buffers > between the modules. It is intended to be used by network file > sharing services like NFS, CIFS or others. > > Although the buffer sharing can be achieved through a few different > solutions, any such solution must work with File Event Monitors > (FEM monitors)[1] installed on the files. The solution must > allow the underlying filesystem to maintain any existing file > range locking in the filesystem. > > The proposed solution provides extensions to the existing VOP > interface to request and return buffers from a filesystem. The > buffers are then used with existing VOP_READ/VOP_WRITE calls with > minimal changes. > > > == Proposed Changes == > > VOP Extensions for Zero-Copy Support > ======================================== > > a. Extended struct uio, xuio_t > > The following proposes an extensible uio structure that can be extended for > multiple purposes. For example, an immediate extension, xu_zc, is to be > used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned > zero-copy buffers, as well as to be passed to the existing > VOP_READ/VOP_WRITE > calls for normal read/write operations. Another example of extension, > xu_aio, is intended to replace uioa_t for async I/O. > > This new structure, xuio_t, contains the following: > > - the existing uio structure (embedded) as the first member > - additional fields to support extensibility > - a union of all the defined extensions > > The following uio_extflag is added to indicate that an uio structure is > indeed an xuio_t: > > #define UIO_XUIO 0x004 /* Structure is xuio_t */ > > The following uio_extflag will be removed after uioa_t has been converted > to xuio_t: > > #define UIO_ASYNC 0x002 /* Structure is xuio_t */ > > The project team has commitment from the networking team to remove > the current use of uioa_t and use the proposed extensions (CR 6880095). > > The definition of xuio_t is: > > typedef struct xuio { > uio_t xu_uio; /* Embedded UIO structure */ > > /* Extended uio fields */ > enum xuio_type xu_type; /* What kind of uio structure? */ > > union { > > /* Async I/O Support */ > struct { > uint32_t xu_a_state; /* state of async i/o */ > uint32_t xu_a_state; /* state of async i/o */ > ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ > uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ > void **xu_a_lppp; /* pointer into > lcur->uioa_ppp[] */ > void *xu_a_hwst[4]; /* opaque hardware state */ > uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages > */ > } xu_aio; > > /* Zero Copy Support */ > struct { > enum uio_rw xu_zc_rw; /* the use of the buffer */ > void *xu_zc_priv; /* fs specific */ > } xu_zc; > > } xu_ext; > } xuio_t; > > where xu_type is currently defined as: > > typedef enum xuio_type { > UIOTYPE_ASYNCIO, > UIOTYPE_ZEROCOPY > } xuio_type_t; > > New uio extensions can be added by defining a new xuio_type_t, and adding a > new member to the xu_ext union. > > b. Requesting zero-copy buffers > > #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ > fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) > > int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, > caller_context_t *); > > This function requests buffers associated with file vp in preparation > for a > subsequent zero copy read or write. The extended uio_t -- xuio_t is used > to pass the parameters and results. Only the following fields of xuio_t > are > relevant to this call. > > uiozcp->xu_uio.uio_resid: used by the caller to specify the total length > of the buffer. > > uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file > offset > it would like the buffers to be associated with. A value of -1 > indicates that the provider returns buffers that are not associated > with a particular offset. These are defined to be anonymous > buffers. > Anonymous buffers may be used for requesting a write buffer to > receive > data over the wire, where file offset might not be handily > available. > > uiozcp->xu_uio.uio_iov: used by the provider to return an array of > buffers > (in case multiple filesystem buffers have to be reserved for the > requested length). > > uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number of > returned buffers (length of array uiop->uio_iov). > > Other arguments to the call include: > > vp: vnode pointer of the associated file. > > rwflag: Indicates what the buffers are to be subsequently used for. > Expected values are UIO_READ for VOP_READ() and UIO_WRITE for > VOP_WRITE(). > > Upon successful completion, the function returns 0. One or more > buffers may be returned as referenced by uio_iov[] and uio_iovcnt > members. > uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is set > to UIOTYPE_ZEROCOPY. > > The caller can use this returned xuio_t in a subsequent call to VOP_READ > or VOP_WRITE. In the case of UIO_READ buffers, the caller should > reference the uio_iov[] buffers only after a successful VOP_READ(). > In the case of UIO_WRITE buffers, the caller should not reference > the uio_iov[] buffers after a successful VOP_WRITE. > > In the case of anonymous buffers, the caller should set the value of > uio_loffset before such a read/write call. This should be done only in > the case of anonymous buffers. > > The member xu_zc_priv of the extended uio structure for zero-copy is > a private handle that may be used by the provider to track its buffer > headers or any other private information that is useful to map the > loaned iovec entries to its internal buffers. The xu_zc_priv member > is private to the provider and should not be changed or interpreted > in anyway by the callers. > > Upon failure, the function returns EINVAL error and the content > of uiozcp should be ignored by the callers. The provider must fail the > request if it is unable to satisfy the complete request (ie. it must > not return buffers that cover only a part of the length that was > asked for). > > Probable causes for failure include: > > - the filesystem is short on buffers to loan out at the time > - the filesystem determines that it's not efficient to take the > zero-copy path based on the input parameters > > c. Returning zero-copy buffers > > #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \ > fop_retzcbuf(vp, uiozcp, cr, ct) > > int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *); > > This function returns the buffers previously obtained via a call > to VOP_REQZCBUF(). In case multiple buffers are associated with the > uio_iov[], all the buffers associated with the uiozcp are returned. > In other words, VOP_RETZCBUF() should only be called once per xuio_t. > The caller should not reference any of the uio_iov[] members after > a return. > > d. New VFS feature attributes > > A new VFS feature attribute is introduced for the support of > zero-copy interface. > > #define VFSFT_ZEROCOPY_SUPPORTED 0x100000100 > > Zero-copy is an optional feature. A filesystem supporting the > zero-copy interface (ie. the Interface Provider) must set this > VFS feature attribute through the VFS Feature Registration > interface[2]. Callers of the interface (ie. Interface Consumer) > must check the presence of support through vfs_has_feature() interface. > The intermediate fop routines (called via the VOP_* macros) will detect > if the interfaces are being called for a filesystem that does not support > zero-copy and will return ENOTSUP. > > INTERFACE TABLE > > +==========================================================================================+ > |Proposed |Specified | > |Stability |in what | > Interface Name |Classification |Document? | Comments > > +==========================================================================================+ > VOP_REQZCBUF() |Consolidation |This | New VOP calls > fop_reqzcbuf() |Private |Document | > VOP_RETZCBUF() | | | > fop_retzcbuf() | | | > | | | > VFSFT_ZEROCOPY_SUPPORTED | | | New VFS feature > definition > | | | > xuio_t | | | Extended uio_t > definition > | | | > | | | > uioa_t | | | Deprecated > UIO_ASYNC | | | Deprecated > > +==========================================================================================+ > > * The project's deliverables will all go into the OS/NET > Consolidation, so no contracts are required. > > > == Using the New VOP Interfaces for Zero-copy == > > VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with > VOP_READ() or VOP_WRITE() to implement zero-copy read or write. > > a. Read > > In a normal read, the consumer allocates the data buffer and passes it to > VOP_READ(). The provider initiates the I/O, and copies the data from its > own cache buffer to the consumer supplied buffer. > > To avoid the copy (initiating a zero-copy read), the consumer first calls > VOP_REQZCBUF() to inform the provider to prepare to loan out its cache > buffer. It then calls VOP_READ(). After the call returns, the consumer > has direct access to the cache buffer loaned out by the provider. After > processing the data, the consumer calls VOP_RETZCBUF() to return the > loaned > cache buffer to the provider. > > Here is an illustration using NFSv4 read over TCP: > > rfs4_op_read(nfs_argop4 *argop, ...) > { > int zerocopy; > xuio_t *xuio; > ... > xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); > setup length, offset, etc; > if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) { > zerocopy = 0; > allocate the data buffer the normal way; > initialize (uio_t *)xuio; > } else { > /* xuio has been setup by the provider */ > zerocopy = 1; > } > do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct); > ... > if (zerocopy) { > setup callback mechanism that makes the network layer call > VOP_RETZCBUF() and free xuio after the data is sent out; > } else { > kmem_free(xuio, sizeof(xuio_t)); > } > } > > b. Write > > In a normal write, the consumer allocates the data buffer, loads the > data, > and passes the buffer to VOP_WRITE(). The provider copies the data from > the consumer supplied buffer to its own cache buffer, and starts the I/O. > > To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to > grab a cache buffer from the provider. It loads the data directly to > the loaned cache buffer, and calls VOP_WRITE(). After the call returns, > the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to > the provider. > > Here is an illustration using NFSv4 write via RDMA: > > rfs4_op_write(nfs_argop4 *argop, ...) > { > int zerocopy; > xuio_t *xuio; > ... > xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); > setup length, offset, etc; > if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) { > zerocopy = 0; > allocate the data buffer the normal way; > initialize (uio_t *)xuio; > xdrrdma_read_from_client(...); > } else { > /* xuio has been setup by the provider */ > zerocopy = 1; > xdrrdma_zcopy_read_from_client(..., xuio); > } > do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct); > ... > if (zerocopy) { > VOP_RETZCBUF(vp, xuio, cr, &ct); > } > kmem_free(xuio, sizeof(xuio_t)); > } > > > References: > [1] PSARC/2003/172 File Event Monitoring > [2] PSARC/2007/227 VFS Features > > > 6. Resources and Schedule > 6.4. Steering Committee requested information > 6.4.1. Consolidation C-team Name: > ON > 6.5. ARC review type: FastTrack > 6.6. ARC Exposure: open >