+1. -- Garrett
Brian Utterback wrote: > I am sponsoring this fasttrack on behalf of Robert Harris. The > timeout is set to 06/19/2009. Requested binding is patch. > > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI > This information is Copyright 2009 Sun Microsystems > 1. Introduction > 1.1. Project/Component Working Name: > Abandon the use of snapshots in mntfs. > 1.2. Name of Document Author/Supplier: > Author: Robert Harris > 1.3 Date of This Document: > 12 June, 2009 > 4. Technical Description > 1. Proposal: > > Abandon the use of snapshots in mntfs. > > > 2. The Problem: > > The contents of /etc/mnttab are created by mntfs on demand. > mntfs parses the in-kernel mnttab structures to create a > snapshot that can be used to satisfy subsequent calls to > read() or ioctl(). The snapshot is stored by the kernel > within the address space of the process that made the first > call to read() or ioctl(). The enclosing mapping is removed > from the calling process's address space by mntfs on last > close(). > > The snapshot-in-userland design has a flaw: the kernel cannot > determine whether or not a close() is a specific process's > last if the vnode count is greater than 1. This is because > there is no way to determine whether a count that is greater > than one has originated from dup(), from fork() or from > both. > > This means that mntfs is unable to ensure that every > insertion of a mapping into a process's address space is > paired with a corresponding deletion. Two specific > manifestations are 6394241, in which a newly-execed process > has an arbitrary range of its address space unmapped by > mntfs, and 6813502, in which a process address space is > entirely consumed by orphaned mappings left behind by mntfs. > > > 3. Solutions: > > The most obvious solution seemed, at first, to involve > storing the snapshot data within the corresponding vnode, > thereby allowing the existing file system infrastructure to > free the resources when no longer required. This, however, > was rejected on account of complications inherent in the > unprivileged user's resulting ability to allocate and retain > kernel memory. > > The only choice left has been to abandon the use of snapshots > in their current form. This necessitates some minor changes > to the behaviour of /etc/mnttab and its API, described in > mnttab(4) and getmntent(3C). > > The current snapshot implementation means that, until a call > to close() or resetmnttab(), clients reading /etc/mnttab will > see those resources that were mounted at the time the > snapshot was created, i.e. at the first read() or ioctl(). > Thus resources that have been unmounted in the intervening > time will still appear to be present. > > With the proposed changes, a process will not see any > resources that have been unmounted since the first call to > read() or ioctl(), with one exception: if a call to read() > terminates in the middle of a line, then the next read() will > be obliged to consume the remainder of that line, even if the > corresponding resource has been unmounted in the intervening > time. This prevents the possibility of seemingly-garbled > text. > > Note that where the remainder of a line is stored for > possible later consumption, it is kept on the corresponding > vnode's private structure. > > > 4. Impact: > > 4.1 Overview: > > The current API includes an ioctl for obtaining the number of > mounted resources within the snapshot (MNTIOC_NMNTS) and > another ioctl for obtaining the major and minor numbers for > these resources (MNTIOC_GETDEVLIST). The first ioctl is used > to obtain the size of an array to pass to the second ioctl. > > Following the proposed changes, MNTIOC_NMNTS will return the > number of resources currently mounted by the kernel. > However, many of the mounted resources are usually hidden; > they never appear during a read() of /etc/mnttab, and are > visible to ioctl() only when specifically requested. The > value returned by MNTIOC_NMNTS will therefore be viewed by > the majority of consumers as an over-estimate of the number > of mounted resources. In reality, the value obtained by > MNTIOC_NMNTS will be defined as the upper-limit on the number > of mounted resources, and should be used only to determine > the length of the array passed to MNTIOC_GETDEVLIST. > > MNTIOC_GETDEVLIST will, following the proposed changes, > populate the supplied array with the major and minor > numbers of only those mouted resources that are > visible to the user. Typically, hence, this will leave > many entries in the supplied array undefined. With > the proposed changes, the MNTIOC_GETDEVLIST ioctl() > itself will return the number of mounted resources, > and hence the number of meaningful entries in the > supplied array. In the current mntfs implementation, > an ioctl() for MNTIOC_GETDEVLIST does not employ > its return value for anything other than to indicate > an error. > > In theory, then, this change introduces a backwards > incompatability: existing code that uses MNTIOC_NMNTS and > then MNTIOC_GETDEVLIST to obtain the major and minor numbers > of mounted resources will find that the last entries are > meaningless. However, MNTIOC_GETDEVLIST has not worked since > S10 FCS: it now returns nonsense, as described in 6814666. > > Implementing the proposed changes calls for additions to the > zone_t and vfs_t structs. The zone_t will acquire a pointer > to an avl_tree_t, and the vfs_t will acquire a pointer to a > newly-defined structure. The purpose is to allow each vfs_t > to be stored in an AVL tree, sorted by a unique > high-resolution time. This is to allow rapid location of the > next available vfs_t in the mnttab table. If its predecessor > were unmounted then there would be no vfs_next pointer to > follow, and a linear search would otherwise be required from > the start of the circularly-linked list. > > 4.2 Interface changes: > > 1. The MNTIOC_GETDEVLIST command is modified so that the > calling ioctl() returns the number of mounted resources > represented in the supplied array, which is the same > as the number of visible resources mounted on the system. > This interface will be Uncommitted. > > 2. The vfs struct acquires a new member, vfs_mntmeta, which > is a pointer to a new, private structure with type > 'struct vfs_mntmeta'. The new member and the private > structure will constitute a Private interface. > > 3. The zone struct acquires a new member, zone_vfstree, > which is a pointer to an avl_tree_t. The new member > will constitute a Private interface. > > > 5. Release binding: > > Patch. > > > 6. Documentation impact: > > Changes to the mnttab(4) and getmntent(3C) man pages: > > *** mnttab.old Thu Jun 11 14:40:19 2009 > --- mnttab.new Thu Jun 11 14:38:09 2009 > *************** > *** 47,66 **** > IOCTLS > The following ioctl(2) calls are supported: > > ! MNTIOC_NMNTS Returns the count of mounted resources > ! in the current snapshot in the uint32_t > ! pointed to by arg. > > ! MNTIOC_GETDEVLIST Returns an array of uint32_t's that is > ! twice as long as the length returned by > ! MNTIOC_NMNTS. Each pair of numbers is > ! the major and minor device number for > ! the file system at the corresponding > ! line in the current /etc/mnttab > ! snapshot. arg points to the memory > ! buffer to receive the device number > ! information. > > MNTIOC_SETTAG Sets a tag word into the options list > for a mounted file system. A tag is a > notation that will appear in the > --- 47,87 ---- > IOCTLS > The following ioctl(2) calls are supported: > > ! MNTIOC_NMNTS Obtains the upper limit on the number > ! of mounted resources. arg points to a > ! uint32_t; this will be set to the upper > ! limit on the number of mounted > ! resources that will be identified by a > ! subsequent MNTIOC_GETDEVLIST. > > ! MNTIOC_GETDEVLIST Obtains the actual number of mounted > ! resources, together with their major > ! and minor numbers. arg points to an > ! array of uint_ts that must be at least > ! twice as long as the length obtained by > ! MNTIOC_NMNTS. The array will contain a > ! pair of numbers for each mounted > ! resource, comprising its major and > ! minor numbers. > > + A resource will not be represented in > + the array if it was mounted after the > + preceding MNTIOC_NMNTS command. It is > + an error to use MNTIOC_GETDEVLIST > + without having first used MNTIOC_NMNTS. > + > + The number of mounted resources actu- > + ally represented in the array will be > + returned by the call to ioctl() itself. > + The values of any remaining elements of > + the array are undefined. > + > + A process that has used either > + MNTIOC_NMNTS or MNTIOC_GETDEVLIST must > + call resetmnttab(3C) before > + getmntent(3C), getextmntent(3C) or > + getmntany(3C). > + > MNTIOC_SETTAG Sets a tag word into the options list > for a mounted file system. A tag is a > notation that will appear in the > *************** > *** 101,109 **** > location. > > EINVAL The tag specified in a MNTIOC_SETTAG call > ! already exists as a file system option, or > ! the tag specified in a MNTIOC_CLRTAG call > ! does not exist. > > ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is > too long or the tag would make the total > --- 122,132 ---- > location. > > EINVAL The tag specified in a MNTIOC_SETTAG call > ! already exists as a file system option, the > ! tag specified in a MNTIOC_CLRTAG call does > ! not exist or a request for MNTIOC_GETDEVLIST > ! was made without a prior request for > ! MNTIOC_NMNTS. > > ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is > too long or the tag would make the total > *************** > *** 144,156 **** > ments. > > NOTES > ! The snapshot of the mnttab information is taken any time a > ! read(2) is performed at offset 0 (the beginning) of the > ! mnttab file. The file modification time returned by stat(2) > ! for the mnttab file is the time of the last change to > ! mounted file system information. A poll(2) system call > ! requesting a POLLRDBAND event can be used to block and wait > ! for the system's mounted file system information to be dif- > ! ferent from the most recent snapshot since the mnttab file > ! was opened. > > --- 167,204 ---- > ments. > > NOTES > ! During a call to read(2) of /etc/mnttab, the corresponding > ! in-kernel information cannot change. However, it will do so > ! between successive calls to read(2) if, for example, > ! resources are unmounted. The underlying file system, mntfs, > ! implements two features to ensure that /etc/mnttab will con- > ! tain sensible data even if there are changes to the in- > ! kernel table of mounted resources. > ! > ! Firstly, if a call to read(2) terminates only part of the > ! way through a line, then the next call to read(2) will start > ! by reading the remainder of the interrupted line, even if > ! the corresponding resource has been unmounted in the inter- > ! vening time. > ! > ! Secondly, successive calls to read(2) will return 0 after > ! reading the newest resource that was mounted at the time of > ! the first call to read(2), even if, in the intervening time, > ! additional resources have been mounted and are still > ! present. > ! > ! Following a rewind(3C) of /etc/mnttab, or a call to > ! resetmnttab(3C), the next call to read(2) will be considered > ! the first: any saved remainder will be discarded and all > ! resources mounted at that time are eligible to be read by > ! subsequent calls to read(2). /etc/mnttab does not support > ! the use of a file offset for any purpose other than rewind- > ! ing the file. > ! > ! The file modification time returned by stat(2) for the > ! mnttab file is the time of the last change to mounted file > ! system information. A poll(2) system call requesting a > ! POLLRDBAND event can be used to block and wait for the > ! system's mounted file system information to be different > ! from that at the time of the first read(2) of mnttab. > > > *** getmntent.old Thu Jun 11 14:37:35 2009 > --- getmntent.new Thu Jun 11 14:41:24 2009 > *************** > *** 40,51 **** > > Each getmntent() call causes a new line to be read from the > mnttab file. Successive calls can be used to search the > ! entire list. The getmntany() function searches the file > ! referenced by fp until a match is found between a line in > ! the file and mpref. A match occurs if all non-null entries > ! in mpref match the corresponding fields in the file. These > ! functions do not open, close, or rewind the file. > > getextmntent() > The getextmntent() function is an extended version of the > getmntent() function that returns, in addition to the infor- > --- 40,58 ---- > > Each getmntent() call causes a new line to be read from the > mnttab file. Successive calls can be used to search the > ! entire list, although mnttab entries added by the kernel > ! after the first call to getmntent() will be ignored. Follow- > ! ing a call to resetmnttab(), the next call to getmntent() > ! will be considered the first: all resources mounted at that > ! time will be eligible to be read by subsequent calls to > ! getmntent(). > > + The getmntany() function searches the file referenced by fp > + until a match is found between a line in the file and mpref. > + A match occurs if all non-null entries in mpref match the > + corresponding fields in the file. These functions do not > + open, close, or rewind the file. > + > getextmntent() > The getextmntent() function is an extended version of the > getmntent() function that returns, in addition to the infor- > *************** > *** 53,63 **** > of the mounted resource to which the line in mnttab > corresponds. The getextmntent() function also fills in the > extmntent structure defined in the <sys/mnttab.h> header. > ! For getextmntent() to function properly, it must be notified > ! when the mnttab file has been reopened or rewound since a > ! previous getextmntent() call. This notification is accom- > ! plished by calling resetmnttab(). Otherwise, it behaves > ! exactly as getmntent() described above. > > The data pointed to by the mnttab structure members are > stored in a static area and must be copied to be saved > --- 60,67 ---- > of the mounted resource to which the line in mnttab > corresponds. The getextmntent() function also fills in the > extmntent structure defined in the <sys/mnttab.h> header. > ! Otherwise, it behaves exactly as getmntent() described > ! above. > > The data pointed to by the mnttab structure members are > stored in a static area and must be copied to be saved > *************** > *** 77,89 **** > sition purposes. > > resetmnttab() > ! The resetmnttab() function notifies getextmntent() to reload > ! from the kernel the device information that corresponds to > ! the new snapshot of the mnttab information (see mnttab(4)). > ! Subsequent getextmntent() calls then return correct > ! extmnttab information. This function should be called when- > ! ever the mnttab file is either rewound or closed and reo- > ! pened before any calls are made to getextmntent(). > > RETURN VALUES > getmntent() and getmntany() > --- 81,91 ---- > sition purposes. > > resetmnttab() > ! The resetmnttab() function causes the next call to > ! getmntent(), getextmntent() or getmntany() to behave as > ! though /etc/mnttab had just been opened. In addition, this > ! function will have a similar effect on read(2); see > ! mnttab(4) for more details. > > RETURN VALUES > getmntent() and getmntany() > > > 7. References: > > 1. CR 6394241 mntfs is not exec safe > > 2. CR 6813502 mntfs will leak mappings when called from a forking MT program. > > 3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense > > 6. Resources and Schedule > 6.4. Steering Committee requested information > 6.4.1. Consolidation C-team Name: > ON > 6.5. ARC review type: FastTrack > 6.6. ARC Exposure: open > >