I am sponsoring this fasttrack on behalf of Robert Harris. The timeout is set to 06/19/2009. Requested binding is patch.
Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Abandon the use of snapshots in mntfs. 1.2. Name of Document Author/Supplier: Author: Robert Harris 1.3 Date of This Document: 12 June, 2009 4. Technical Description 1. Proposal: Abandon the use of snapshots in mntfs. 2. The Problem: The contents of /etc/mnttab are created by mntfs on demand. mntfs parses the in-kernel mnttab structures to create a snapshot that can be used to satisfy subsequent calls to read() or ioctl(). The snapshot is stored by the kernel within the address space of the process that made the first call to read() or ioctl(). The enclosing mapping is removed from the calling process's address space by mntfs on last close(). The snapshot-in-userland design has a flaw: the kernel cannot determine whether or not a close() is a specific process's last if the vnode count is greater than 1. This is because there is no way to determine whether a count that is greater than one has originated from dup(), from fork() or from both. This means that mntfs is unable to ensure that every insertion of a mapping into a process's address space is paired with a corresponding deletion. Two specific manifestations are 6394241, in which a newly-execed process has an arbitrary range of its address space unmapped by mntfs, and 6813502, in which a process address space is entirely consumed by orphaned mappings left behind by mntfs. 3. Solutions: The most obvious solution seemed, at first, to involve storing the snapshot data within the corresponding vnode, thereby allowing the existing file system infrastructure to free the resources when no longer required. This, however, was rejected on account of complications inherent in the unprivileged user's resulting ability to allocate and retain kernel memory. The only choice left has been to abandon the use of snapshots in their current form. This necessitates some minor changes to the behaviour of /etc/mnttab and its API, described in mnttab(4) and getmntent(3C). The current snapshot implementation means that, until a call to close() or resetmnttab(), clients reading /etc/mnttab will see those resources that were mounted at the time the snapshot was created, i.e. at the first read() or ioctl(). Thus resources that have been unmounted in the intervening time will still appear to be present. With the proposed changes, a process will not see any resources that have been unmounted since the first call to read() or ioctl(), with one exception: if a call to read() terminates in the middle of a line, then the next read() will be obliged to consume the remainder of that line, even if the corresponding resource has been unmounted in the intervening time. This prevents the possibility of seemingly-garbled text. Note that where the remainder of a line is stored for possible later consumption, it is kept on the corresponding vnode's private structure. 4. Impact: 4.1 Overview: The current API includes an ioctl for obtaining the number of mounted resources within the snapshot (MNTIOC_NMNTS) and another ioctl for obtaining the major and minor numbers for these resources (MNTIOC_GETDEVLIST). The first ioctl is used to obtain the size of an array to pass to the second ioctl. Following the proposed changes, MNTIOC_NMNTS will return the number of resources currently mounted by the kernel. However, many of the mounted resources are usually hidden; they never appear during a read() of /etc/mnttab, and are visible to ioctl() only when specifically requested. The value returned by MNTIOC_NMNTS will therefore be viewed by the majority of consumers as an over-estimate of the number of mounted resources. In reality, the value obtained by MNTIOC_NMNTS will be defined as the upper-limit on the number of mounted resources, and should be used only to determine the length of the array passed to MNTIOC_GETDEVLIST. MNTIOC_GETDEVLIST will, following the proposed changes, populate the supplied array with the major and minor numbers of only those mouted resources that are visible to the user. Typically, hence, this will leave many entries in the supplied array undefined. With the proposed changes, the MNTIOC_GETDEVLIST ioctl() itself will return the number of mounted resources, and hence the number of meaningful entries in the supplied array. In the current mntfs implementation, an ioctl() for MNTIOC_GETDEVLIST does not employ its return value for anything other than to indicate an error. In theory, then, this change introduces a backwards incompatability: existing code that uses MNTIOC_NMNTS and then MNTIOC_GETDEVLIST to obtain the major and minor numbers of mounted resources will find that the last entries are meaningless. However, MNTIOC_GETDEVLIST has not worked since S10 FCS: it now returns nonsense, as described in 6814666. Implementing the proposed changes calls for additions to the zone_t and vfs_t structs. The zone_t will acquire a pointer to an avl_tree_t, and the vfs_t will acquire a pointer to a newly-defined structure. The purpose is to allow each vfs_t to be stored in an AVL tree, sorted by a unique high-resolution time. This is to allow rapid location of the next available vfs_t in the mnttab table. If its predecessor were unmounted then there would be no vfs_next pointer to follow, and a linear search would otherwise be required from the start of the circularly-linked list. 4.2 Interface changes: 1. The MNTIOC_GETDEVLIST command is modified so that the calling ioctl() returns the number of mounted resources represented in the supplied array, which is the same as the number of visible resources mounted on the system. This interface will be Uncommitted. 2. The vfs struct acquires a new member, vfs_mntmeta, which is a pointer to a new, private structure with type 'struct vfs_mntmeta'. The new member and the private structure will constitute a Private interface. 3. The zone struct acquires a new member, zone_vfstree, which is a pointer to an avl_tree_t. The new member will constitute a Private interface. 5. Release binding: Patch. 6. Documentation impact: Changes to the mnttab(4) and getmntent(3C) man pages: *** mnttab.old Thu Jun 11 14:40:19 2009 --- mnttab.new Thu Jun 11 14:38:09 2009 *************** *** 47,66 **** IOCTLS The following ioctl(2) calls are supported: ! MNTIOC_NMNTS Returns the count of mounted resources ! in the current snapshot in the uint32_t ! pointed to by arg. ! MNTIOC_GETDEVLIST Returns an array of uint32_t's that is ! twice as long as the length returned by ! MNTIOC_NMNTS. Each pair of numbers is ! the major and minor device number for ! the file system at the corresponding ! line in the current /etc/mnttab ! snapshot. arg points to the memory ! buffer to receive the device number ! information. MNTIOC_SETTAG Sets a tag word into the options list for a mounted file system. A tag is a notation that will appear in the --- 47,87 ---- IOCTLS The following ioctl(2) calls are supported: ! MNTIOC_NMNTS Obtains the upper limit on the number ! of mounted resources. arg points to a ! uint32_t; this will be set to the upper ! limit on the number of mounted ! resources that will be identified by a ! subsequent MNTIOC_GETDEVLIST. ! MNTIOC_GETDEVLIST Obtains the actual number of mounted ! resources, together with their major ! and minor numbers. arg points to an ! array of uint_ts that must be at least ! twice as long as the length obtained by ! MNTIOC_NMNTS. The array will contain a ! pair of numbers for each mounted ! resource, comprising its major and ! minor numbers. + A resource will not be represented in + the array if it was mounted after the + preceding MNTIOC_NMNTS command. It is + an error to use MNTIOC_GETDEVLIST + without having first used MNTIOC_NMNTS. + + The number of mounted resources actu- + ally represented in the array will be + returned by the call to ioctl() itself. + The values of any remaining elements of + the array are undefined. + + A process that has used either + MNTIOC_NMNTS or MNTIOC_GETDEVLIST must + call resetmnttab(3C) before + getmntent(3C), getextmntent(3C) or + getmntany(3C). + MNTIOC_SETTAG Sets a tag word into the options list for a mounted file system. A tag is a notation that will appear in the *************** *** 101,109 **** location. EINVAL The tag specified in a MNTIOC_SETTAG call ! already exists as a file system option, or ! the tag specified in a MNTIOC_CLRTAG call ! does not exist. ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is too long or the tag would make the total --- 122,132 ---- location. EINVAL The tag specified in a MNTIOC_SETTAG call ! already exists as a file system option, the ! tag specified in a MNTIOC_CLRTAG call does ! not exist or a request for MNTIOC_GETDEVLIST ! was made without a prior request for ! MNTIOC_NMNTS. ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is too long or the tag would make the total *************** *** 144,156 **** ments. NOTES ! The snapshot of the mnttab information is taken any time a ! read(2) is performed at offset 0 (the beginning) of the ! mnttab file. The file modification time returned by stat(2) ! for the mnttab file is the time of the last change to ! mounted file system information. A poll(2) system call ! requesting a POLLRDBAND event can be used to block and wait ! for the system's mounted file system information to be dif- ! ferent from the most recent snapshot since the mnttab file ! was opened. --- 167,204 ---- ments. NOTES ! During a call to read(2) of /etc/mnttab, the corresponding ! in-kernel information cannot change. However, it will do so ! between successive calls to read(2) if, for example, ! resources are unmounted. The underlying file system, mntfs, ! implements two features to ensure that /etc/mnttab will con- ! tain sensible data even if there are changes to the in- ! kernel table of mounted resources. ! ! Firstly, if a call to read(2) terminates only part of the ! way through a line, then the next call to read(2) will start ! by reading the remainder of the interrupted line, even if ! the corresponding resource has been unmounted in the inter- ! vening time. ! ! Secondly, successive calls to read(2) will return 0 after ! reading the newest resource that was mounted at the time of ! the first call to read(2), even if, in the intervening time, ! additional resources have been mounted and are still ! present. ! ! Following a rewind(3C) of /etc/mnttab, or a call to ! resetmnttab(3C), the next call to read(2) will be considered ! the first: any saved remainder will be discarded and all ! resources mounted at that time are eligible to be read by ! subsequent calls to read(2). /etc/mnttab does not support ! the use of a file offset for any purpose other than rewind- ! ing the file. ! ! The file modification time returned by stat(2) for the ! mnttab file is the time of the last change to mounted file ! system information. A poll(2) system call requesting a ! POLLRDBAND event can be used to block and wait for the ! system's mounted file system information to be different ! from that at the time of the first read(2) of mnttab. *** getmntent.old Thu Jun 11 14:37:35 2009 --- getmntent.new Thu Jun 11 14:41:24 2009 *************** *** 40,51 **** Each getmntent() call causes a new line to be read from the mnttab file. Successive calls can be used to search the ! entire list. The getmntany() function searches the file ! referenced by fp until a match is found between a line in ! the file and mpref. A match occurs if all non-null entries ! in mpref match the corresponding fields in the file. These ! functions do not open, close, or rewind the file. getextmntent() The getextmntent() function is an extended version of the getmntent() function that returns, in addition to the infor- --- 40,58 ---- Each getmntent() call causes a new line to be read from the mnttab file. Successive calls can be used to search the ! entire list, although mnttab entries added by the kernel ! after the first call to getmntent() will be ignored. Follow- ! ing a call to resetmnttab(), the next call to getmntent() ! will be considered the first: all resources mounted at that ! time will be eligible to be read by subsequent calls to ! getmntent(). + The getmntany() function searches the file referenced by fp + until a match is found between a line in the file and mpref. + A match occurs if all non-null entries in mpref match the + corresponding fields in the file. These functions do not + open, close, or rewind the file. + getextmntent() The getextmntent() function is an extended version of the getmntent() function that returns, in addition to the infor- *************** *** 53,63 **** of the mounted resource to which the line in mnttab corresponds. The getextmntent() function also fills in the extmntent structure defined in the <sys/mnttab.h> header. ! For getextmntent() to function properly, it must be notified ! when the mnttab file has been reopened or rewound since a ! previous getextmntent() call. This notification is accom- ! plished by calling resetmnttab(). Otherwise, it behaves ! exactly as getmntent() described above. The data pointed to by the mnttab structure members are stored in a static area and must be copied to be saved --- 60,67 ---- of the mounted resource to which the line in mnttab corresponds. The getextmntent() function also fills in the extmntent structure defined in the <sys/mnttab.h> header. ! Otherwise, it behaves exactly as getmntent() described ! above. The data pointed to by the mnttab structure members are stored in a static area and must be copied to be saved *************** *** 77,89 **** sition purposes. resetmnttab() ! The resetmnttab() function notifies getextmntent() to reload ! from the kernel the device information that corresponds to ! the new snapshot of the mnttab information (see mnttab(4)). ! Subsequent getextmntent() calls then return correct ! extmnttab information. This function should be called when- ! ever the mnttab file is either rewound or closed and reo- ! pened before any calls are made to getextmntent(). RETURN VALUES getmntent() and getmntany() --- 81,91 ---- sition purposes. resetmnttab() ! The resetmnttab() function causes the next call to ! getmntent(), getextmntent() or getmntany() to behave as ! though /etc/mnttab had just been opened. In addition, this ! function will have a similar effect on read(2); see ! mnttab(4) for more details. RETURN VALUES getmntent() and getmntany() 7. References: 1. CR 6394241 mntfs is not exec safe 2. CR 6813502 mntfs will leak mappings when called from a forking MT program. 3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open