I'm sponsoring the following fast track for Afshin Salek and the CIFS i-team. It times out on Friday, July 17th.
A copy of the specification below appears in the case directory under the name "specification". I've pre-reviewed it and will give it a +1 up front. -- Glenn ---------------- Template Version: @(#)onepager.txt 1.35 07/11/07 SMI Copyright 2007 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Support for Reparse Points 1.2. Name of Document Author/Supplier: Author: Afshin Salek 1.3. Date of This Document: 07/08/09 1.4. Name of Major Document Customer(s)/Consumer(s): PSARC CIFS team 1.5. Email Aliases: 1.5.1. Responsible Manager: Barry.Greenberg at Sun.COM 1.5.2. Responsible Engineer: Afshin.Ardakani at Sun.COM 1.5.3. Marketing Manager: 1.5.4. Interest List: cifs-team at sun.com A patch binding is requested for this change. 4. Technical Description: 4.1. Details: INTRODUCTION There are situations where a mechanism is needed to reflect the concept that data is not present at a particular path, but can be found in some alternate location(s). Examples include "referrals" used to build unified name spaces in NFSv4.x and SMB, and data relocation in HSM systems. A "reparse point" is defined as the marker for a namespace redirection and a container for the metadata to specify where the target of this redirection is. Reparse points are intended to be a general mechanism for location redirection and as such the file system that contains them is not cognizant of the reparse point format or content. Services that use reparse points know how to interpret and use the stored data. REPARSE POINT OBJECT After a lot of discussion the consensus is that the best way to represent reparse points in the file system, in order to minimize the effect on existing applications and utilities, to use symbolic links. One of the main goals in this context has been the ability to use existing utilities for backup/restore and also ZFS send/receive without having to modify them to know how to deal with reparse points. Some of what is envisioned here could be done with extensions to the Solaris automounter capability. Part of the motivation, though, is to create centrally-administrated namespaces served by a group of fileservers to near-zero-admin clients. It is expected to be easier to keep the namespaces uniform if only a small number of servers need to participate. HSM solutions would also normally be tied closely to a storage server by this mechanism. Also, for both NFS and SMB referrals, it is the client that chooses the target and not the server. The server only provides the targets' information and it is up to the client to pick the desirable target to access the data. To distinguish a regular symlink from a reparse point, an extensible system attribute will be set on the symlink. This system attribute is only one bit which indicates whether or not a symlink contains reparse data. The reparse data will be stored as the link target. The reparse data is not in file system path format, which is the typical format of a link target. In order to avoid coming up with a totaly new format for reparse data as the link target we decided to adopt the format used by magic links in BSD: (http://www.daemon-systems.org/man/symlink.7.html) @{repa...@{service-type1:data} [...@{service-type2:data}]...} Where some examples of service-type are: #define REPARSE_SVC_SMB "SMB" #define REPARSE_SVC_NFS "NFS" #define REPARSE_SVC_HSM "HSM" The data for each service will be in string format, which is expected to be typically a UUID string. The pattern above starts with "REPARSE" to distinguish it from a other magic links, such as those supported by BSD. Note that this case is not a proposal to support BSD magic links, the intent is to avoid precluding the future addition of full BSD magic link support. Multiple services entries can co-exist within the symlink data. It is expected that normally, all entries would resolve to the same logical location, e.g. NFS and CIFS clients would find the same files. BASIC INTERFACES There is a need for both userspace and kernel APIs to work with reparse points. Userspace API In userspace the symlink(2) system call will be used to set a reparse point. The readlink(2) system call will be used in turn to read the reparse data. Kernel API In the kernel, VOP_SYMLINK and VOP_READLINK will be used to set/get reparse data. These interfaces will support all replication, archive and copy operations to preserve reparse points without further changes. fop_symlink() needs to be modified to recognize the reparse @{REPARSE} tag and pass the appropriate attribute (i.e. reparse system attribute) to VOP_SYMLINK to be set on the symlink. IMPLEMENTATION OBSERVATIONS VFS feature registration can be used to determine whether or not a file system supports reparse points. Two things are needed to obtain the reparse point data in the kernel. First, the consumer needs to know that a reparse point has been encountered and, second, it needs the vnode pointer to the symlink. The proposal is to enhance VOP_LOOKUP to return the attributes of the looked up vnode. This way when the vnode is available the caller can check the attributes to determine if the returned vnode is a reparse point or a regular symlink. Here are the old and revised signatures of VOP_LOOKUP: int VOP_LOOKUP(vnode_t *dvp, char *nm, vnode_t **vpp, pathname_t *pnp, int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct, int *deflags, pathname_t *ppnp) int VOP_LOOKUP(vnode_t *dvp, char *nm, vnode_t **vpp, pathname_t *pnp, int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct, int *deflags, pathname_t *ppnp, vattr_t *vap) A vattr_t pointer argument is added at the end to return the attributes if it is non-NULL. This is an optimization so that consumers don't have to invoke an extra VOP_GETATTR after lookup for obtaining the attributes. The symlink target size should be increased to 16K to accomodate the maximum size supported for MS-DFS referrals by Windows. Applications are expected to query the PATH_MAX and SYMLINK_MAX values on the local system using pathconf(2)/fpathconf(2). The value of SYMLINK_MAX would be changed to 16K on ZFS. The value of PATH_MAX will not be affected. To provide compatibility with other UNIXes (see section 6 below), sharemgr(1M) would be enhanced to support a "refer" option for NFS exports. This option would only result in creation of a reparse point at the specified path and does not actually share the path over NFS. This case is only about the underlying infrastructure and a future case will be presented to deal with details and specifics of handling referrals for NFSv4 server. SECURITY CONSIDERATIONS Referrals are similar to regular symbolic links in that they are only pointers to data that could be discovered in some other way. The presence of such a pointer does not compromise the security of the target object or data; the target service or file system must still enforce security. OPERATION FLOW Once a kernel service encounters a reparse point, it reads the data using VOP_READLINK and passes the data up to a user space daemon (e.g. reparsed) along with its desired record type. Depending on the requested record type the daemon could simply extract the information from the passed data and return it to kernel or do any other processing necessary to obtain the actual referral information e.g. in the case of FedFS, contacting NSDB. Going through a common user space daemon to get the referral data makes this process generic and easily expandable for possible future use cases. Referral extraction and creation by a userspace daemon can be handled via a library plugin architecture for different service types. Operation Flow Example Here is a simplified example of operation for a CIFS client that tries to access a file where the path contains a DFS link: a) Client tries to access \\srv\root\...\link\...\file.txt where: 'root' is a share (namespace root) 'link' is a reparse point seen as a folder by client b) CIFS server does a VOP_LOOKUP for 'link' when it is recognized as a reparse point by examining the attributes return by VOP_LOOKUP. At this point a STATUS_PATH_NOT_COVERED is returned to client c) Client sends a "link referral" request to the server. CIFS server uses VOP_READLINK to get the 'link' data and sends the data to 'reparsed' daemon via a door call and gets back the DFS link targets in a format understandable by the CIFS client. The targets are sent back to the client in response to its "link referral" request. b) Client picks one of the targets and contacts the target server to access 'file.txt' NFS REFERRAL IN OTHER UNIXES FS referrals have been implemented in other major UNIX distributions such as Linux, AIX and HP-UX but there is no unified approach or implementation. Linux, AIX and HP-UX specify referrals as an NFS export option. The option format is basically the same in all three operating systems (refer=path at host) but the presentation is somewhat different in each case: - In Linux a referral is presented as a mount point. - In HP-UX a referral is a file system partition or logical volume. - In AIX a special object is used to represent a referral. These are all mechanisms to trigger a change in namespace while resolving a path. This proposal is somewhat aligned with the AIX approach but does not require a new object type to be defined, which has the advantage of not impacting existing applications. As mentioned previously, an NFS "refer" option will be supported to provide option format compatibility. Additionally, the Solaris requirements include support for both NFS and SMB referrals whereas these other operating systems only support NFS referrals, and they do not provide native SMB support. For the Solaris operating system, this proposal provides a generic solution to support multiple, disparate referral mechanisms without placing restrictions on the format required by each mechanism. The following links provide a bit more details about each OS discussed above: http://www.citi.umich.edu/projects/nfsv4/linux/using-referrals.html http://nfsv4.bullopensource.org/doc/migration-and-replication-0.2.pdf http://docs.hp.com/en/5900-0306/ch01s11.html?jumpid=reg_R1002_USEN http://docs.hp.com/en/13578/nfsv4_whitepaper.pdf http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.commadmn/doc/commadmndita/nfs_referrals.htm INTERFACE TABLE |Proposed |Specified | |Stability |in what | Interface Name |Classification |Document? | Comments =========================================================================== XAT_REPARSE |Consolidation |This |Reparse extensible |Private |Document |attribute | | | VOP_LOOKUP, fop_lookup |Contracted |This |Added new argument: |Consolidation |Document |vattr_t *vap |Private* | | | | | Reparse token syntax |Committed |This | |Private |Document | | | | SYMLINK_MAX |Committed |This |Increased to 16K | |Document | * The project's deliverables will all go into the OS/NET Consolidation, so no contracts are required. 6. Resources and Schedule: 6.4. Product Approval Committee requested information: 6.4.1. Consolidation or Component Name: ON 6.5. ARC review type: FastTrack