Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Roch wrote: Filesystems might have some blocksize and alignment constraints conditioning their ability to loan up buffers (for writes). If that is so, we could use an API to query the FS about those values. For a copy on write variable block size filesystem, that natural blocksize might also depend on the vnode being targetted. Yes. The provider can fail the VOP_REQZCBUF() call if it determines that it is inefficient to take the zero-copy path. Depending on the provider implementation, this could be blocksize aligned. In such cases, the consumer could use VFSNAME_STATVFS() call to determine 'f_bsize' value. But as you note, certain implementations may have different values for individual files. In such cases if the VOP_REQZCBUF() fails, the consumer then uses the traditional non zero-copy path. An additional API to find the such constraints/requirements may be useful in future, but is out-of-scope for this project. However, the project team will open an RFE for this issue and put you on the interest list. Do we know if ZFS will ever be able to loan up buffers for writes that are not aligned full records ? No, not planned currently. It has to be block size aligned. Also note that currently, from an implementation perspective, zero-copy WRITEs are efficient only in case network-based filesystems like NFS over RDMA transports. Mahesh -r Rich.Brown at Sun.COM writes: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #define UIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #define UIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type;/* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur;/* pointer into uioa_locked
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Rick Matthews wrote: Are there instances where an assigned zero-copy buffer could be orphaned? No. The consumer Must release the buffers through VOP_RETZCBUF(). Mahesh If so, should there be a recovery list associated with this addition? Perhaps off the designated vnode. This comment shouldn't block fast-track approval. Just a question. -- Rick On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #defineUIO_XUIO0x004/* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #defineUIO_ASYNC0x002/* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio;/* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type;/* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state;/* state of async i/o */ uint32_t xu_a_state;/* state of async i/o */ ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */ void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4];/* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw;/* the use of the buffer */ void *xu_zc_priv;/* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero copy read or write. The extended uio_t -- xuio_t is used to pass the parameters and results. Only the following fields of xuio_t
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout09/16/2009]
Roland Mainz wrote: Mahesh Siddheshwar wrote: Roland Mainz wrote: Does it make sense to have a |xu_flags| field here for future enhancments ? If future enhancements are needed to extend xuio_t, a new xuio_type can be defined and extended that way. For extensions not specific to xuio, there also exists uio_extflg in the uio_t. Without a particular purpose an additional flag seems unnecessary for zero-copy right now. Right now... yes. But Unix has a little (IMO) ugly tradition of not adding such flag fields and instead swamping the headers with many many variations of one interface over time which could be avoided by use having a flags field as argument (that's a generic issue). [snip] b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *) AFAIK the prototype should have a flags field to allow future changes/extenstions without adding another VOP_*-hook ... Roland, if the extensions/changes are for the purpose of copy reduction/buffer sharing, we don't need to add additional VOP_* routines. The current xuio_t extension is defined just for that. Erm... the idea of having a flags field in |fop_reqzcbuf()| was to allow slight modifications in behaviour - for example in the future there could be flags which describe where (in a NUMA system) the buffer memory resides (e.g. near the calling thread, near a point which is optimal for all consumers, or near the hardware which fills the buffer etc.), whether it should be in the L2 cache or not etc. etc. Roland, I agree with Nico on the drawbacks of adding undefined flags for future use. You make two suggestions: 1) addition of a flag to xuio_t for future use 2) addition of a flag to VOP_REQZCBUF() for future use. It's the project team's opinion that these flags are not needed for this spec, for the following reasons: For 1) the existence of uio_extflg in uio_t and the possibility of extending xuio_t through additional xuio_type's, make an additional 'xu_flags' flag an overhead that can be avoided. For 2) we don't have a specific purpose for the flag right now. If there is a need for additional flags or arguments in future, the VOP routine can be easily extended. As has been done in the recent past with several extensions for existing VOP routines (PSARC 2007/244, 2007/218, 2007/227, 2009/387). The existence of strong type checking for vnode/vfs operations through PSARC 2007/124 make it easier to catch an interface mismatch. Thanks, Mahesh
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
johansen at sun.com wrote: On Wed, Sep 09, 2009 at 04:02:15PM -0500, Rich.Brown at sun.com wrote: == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == ... == Using the New VOP Interfaces for Zero-copy == VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with VOP_READ() or VOP_WRITE() to implement zero-copy read or write. a. Read In a normal read, the consumer allocates the data buffer and passes it to VOP_READ(). The provider initiates the I/O, and copies the data from its own cache buffer to the consumer supplied buffer. To avoid the copy (initiating a zero-copy read), the consumer first calls VOP_REQZCBUF() to inform the provider to prepare to loan out its cache buffer. It then calls VOP_READ(). After the call returns, the consumer has direct access to the cache buffer loaned out by the provider. After processing the data, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. ... b. Write In a normal write, the consumer allocates the data buffer, loads the data, and passes the buffer to VOP_WRITE(). The provider copies the data from the consumer supplied buffer to its own cache buffer, and starts the I/O. To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to grab a cache buffer from the provider. It loads the data directly to the loaned cache buffer, and calls VOP_WRITE(). After the call returns, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. Just for clarification: this interface only affects pages mapped in the kernel, correct? I'm trying to understand if this is just for reducing the number of in-kernel copies, or if this is a userland - kernel zero-copy interface. That is correct. This interface is to prevent in-kernel copies and allow buffer sharing between kernel modules (that can be used by in-kernel services like NFS or CIFS). The spec does not define any userland - kernel zero-copy interface. Thanks, Mahesh Thanks, -j
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout09/16/2009]
Hi Roland, Roland Mainz wrote: [snip] How do you handle sparse files, e.g. files with one or more holes ? Sparse files are not handled any differently in VOP_READ/VOP_WRITE calls when using the zero-copy interface. Modules that want to seek/skip holes can use the _FIO_SEEK_DATA/_FIO_SEEK_HOLE commands of VOP_IOCTL, to do so. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t [snip] The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state;/* state of async i/o */ uint32_t xu_a_state;/* state of async i/o */ ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp; /* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw; /* the use of the buffer */ void *xu_zc_priv; /* fs specific */ Does it make sense to have a |xu_flags| field here for future enhancments ? If future enhancements are needed to extend xuio_t, a new xuio_type can be defined and extended that way. For extensions not specific to xuio, there also exists uio_extflg in the uio_t. Without a particular purpose an additional flag seems unnecessary for zero-copy right now. Please note that there is a typo in the spec, in the definition for struct xu_aio. The below line is printed twice: uint32_t xu_a_state;/* state of async i/o */ A corrected final spec will be posted in the case directory. } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); AFAIK the prototype should have a flags field to allow future changes/extenstions without adding another VOP_*-hook ... Roland, if the extensions/changes are for the purpose of copy reduction/buffer sharing, we don't need to add additional VOP_* routines. The current xuio_t extension is defined just for that. Thanks, Mahesh Bye, Roland
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Garrett D'Amore wrote: I've not had time to go over all this yet, but do we really believe this kind of change is fast track appropriate? I have a feeling that this is a significant enough core change with implications for a variety of project teams, that maybe this one ought to be a full case. I'd be a bit uncomfortable allowing this one to just time out with a single +1, which is the normal rule for fast tracks. Am I alone in this particular concern? Are there any implications for unbundled 3rd party filesystems? Not unless the 3rd party filesystem wants to support this optional feature. This is covered in section (d) of the spec. The intermediate fop routines handle it correctly. Regards, Mahesh - Garrett Rich.Brown at Sun.COM wrote: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #defineUIO_XUIO0x004/* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #defineUIO_ASYNC0x002/* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio;/* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type;/* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state;/* state of async i/o */ uint32_t xu_a_state;/* state of async i/o */ ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */ void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4];/* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw;/* the use of the buffer */ void *xu_zc_priv;/* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag