I little house cleaning. After this case was approved, the project team decided to take a different approach which was submitted and approved in PSARC 2009/566. Since the approach in 2009/352 is no longer valid, I am marking it as withdrawn to avoid future confusion.
> Brian Utterback wrote: >> I am sponsoring this fasttrack on behalf of Robert Harris. The >> timeout is set to 06/19/2009. Requested binding is patch. >> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI >> This information is Copyright 2009 Sun Microsystems >> 1. Introduction >> 1.1. Project/Component Working Name: >> Abandon the use of snapshots in mntfs. >> 1.2. Name of Document Author/Supplier: >> Author: Robert Harris >> 1.3 Date of This Document: >> 12 June, 2009 >> 4. Technical Description >> 1. Proposal: >> >> Abandon the use of snapshots in mntfs. >> >> >> 2. The Problem: >> >> The contents of /etc/mnttab are created by mntfs on demand. >> mntfs parses the in-kernel mnttab structures to create a >> snapshot that can be used to satisfy subsequent calls to >> read() or ioctl(). The snapshot is stored by the kernel >> within the address space of the process that made the first >> call to read() or ioctl(). The enclosing mapping is removed >> from the calling process's address space by mntfs on last >> close(). >> >> The snapshot-in-userland design has a flaw: the kernel cannot >> determine whether or not a close() is a specific process's >> last if the vnode count is greater than 1. This is because >> there is no way to determine whether a count that is greater >> than one has originated from dup(), from fork() or from >> both. >> >> This means that mntfs is unable to ensure that every >> insertion of a mapping into a process's address space is >> paired with a corresponding deletion. Two specific >> manifestations are 6394241, in which a newly-execed process >> has an arbitrary range of its address space unmapped by >> mntfs, and 6813502, in which a process address space is >> entirely consumed by orphaned mappings left behind by mntfs. >> >> >> 3. Solutions: >> >> The most obvious solution seemed, at first, to involve >> storing the snapshot data within the corresponding vnode, >> thereby allowing the existing file system infrastructure to >> free the resources when no longer required. This, however, >> was rejected on account of complications inherent in the >> unprivileged user's resulting ability to allocate and retain >> kernel memory. >> >> The only choice left has been to abandon the use of snapshots >> in their current form. This necessitates some minor changes >> to the behaviour of /etc/mnttab and its API, described in >> mnttab(4) and getmntent(3C). >> >> The current snapshot implementation means that, until a call >> to close() or resetmnttab(), clients reading /etc/mnttab will >> see those resources that were mounted at the time the >> snapshot was created, i.e. at the first read() or ioctl(). >> Thus resources that have been unmounted in the intervening >> time will still appear to be present. >> >> With the proposed changes, a process will not see any >> resources that have been unmounted since the first call to >> read() or ioctl(), with one exception: if a call to read() >> terminates in the middle of a line, then the next read() will >> be obliged to consume the remainder of that line, even if the >> corresponding resource has been unmounted in the intervening >> time. This prevents the possibility of seemingly-garbled >> text. >> >> Note that where the remainder of a line is stored for >> possible later consumption, it is kept on the corresponding >> vnode's private structure. >> >> >> 4. Impact: >> >> 4.1 Overview: >> >> The current API includes an ioctl for obtaining the number of >> mounted resources within the snapshot (MNTIOC_NMNTS) and >> another ioctl for obtaining the major and minor numbers for >> these resources (MNTIOC_GETDEVLIST). The first ioctl is used >> to obtain the size of an array to pass to the second ioctl. >> >> Following the proposed changes, MNTIOC_NMNTS will return the >> number of resources currently mounted by the kernel. >> However, many of the mounted resources are usually hidden; >> they never appear during a read() of /etc/mnttab, and are >> visible to ioctl() only when specifically requested. The >> value returned by MNTIOC_NMNTS will therefore be viewed by >> the majority of consumers as an over-estimate of the number >> of mounted resources. In reality, the value obtained by >> MNTIOC_NMNTS will be defined as the upper-limit on the number >> of mounted resources, and should be used only to determine >> the length of the array passed to MNTIOC_GETDEVLIST. >> >> MNTIOC_GETDEVLIST will, following the proposed changes, >> populate the supplied array with the major and minor >> numbers of only those mouted resources that are >> visible to the user. Typically, hence, this will leave >> many entries in the supplied array undefined. With >> the proposed changes, the MNTIOC_GETDEVLIST ioctl() >> itself will return the number of mounted resources, >> and hence the number of meaningful entries in the >> supplied array. In the current mntfs implementation, >> an ioctl() for MNTIOC_GETDEVLIST does not employ >> its return value for anything other than to indicate >> an error. >> >> In theory, then, this change introduces a backwards >> incompatability: existing code that uses MNTIOC_NMNTS and >> then MNTIOC_GETDEVLIST to obtain the major and minor numbers >> of mounted resources will find that the last entries are >> meaningless. However, MNTIOC_GETDEVLIST has not worked since >> S10 FCS: it now returns nonsense, as described in 6814666. >> >> Implementing the proposed changes calls for additions to the >> zone_t and vfs_t structs. The zone_t will acquire a pointer >> to an avl_tree_t, and the vfs_t will acquire a pointer to a >> newly-defined structure. The purpose is to allow each vfs_t >> to be stored in an AVL tree, sorted by a unique >> high-resolution time. This is to allow rapid location of the >> next available vfs_t in the mnttab table. If its predecessor >> were unmounted then there would be no vfs_next pointer to >> follow, and a linear search would otherwise be required from >> the start of the circularly-linked list. >> >> 4.2 Interface changes: >> >> 1. The MNTIOC_GETDEVLIST command is modified so that the >> calling ioctl() returns the number of mounted resources >> represented in the supplied array, which is the same >> as the number of visible resources mounted on the system. >> This interface will be Uncommitted. >> 2. The vfs struct acquires a new member, vfs_mntmeta, which >> is a pointer to a new, private structure with type >> 'struct vfs_mntmeta'. The new member and the private >> structure will constitute a Private interface. >> 3. The zone struct acquires a new member, zone_vfstree, >> which is a pointer to an avl_tree_t. The new member >> will constitute a Private interface. >> >> >> 5. Release binding: >> >> Patch. >> >> >> 6. Documentation impact: >> >> Changes to the mnttab(4) and getmntent(3C) man pages: >> >> *** mnttab.old Thu Jun 11 14:40:19 2009 >> --- mnttab.new Thu Jun 11 14:38:09 2009 >> *************** >> *** 47,66 **** >> IOCTLS >> The following ioctl(2) calls are supported: >> ! MNTIOC_NMNTS Returns the count of mounted resources >> ! in the current snapshot in the uint32_t >> ! pointed to by arg. >> ! MNTIOC_GETDEVLIST Returns an array of uint32_t's that is >> ! twice as long as the length returned by >> ! MNTIOC_NMNTS. Each pair of numbers is >> ! the major and minor device number for >> ! the file system at the corresponding >> ! line in the current /etc/mnttab >> ! snapshot. arg points to the memory >> ! buffer to receive the device number >> ! information. >> MNTIOC_SETTAG Sets a tag word into the options list >> for a mounted file system. A tag is a >> notation that will appear in the >> --- 47,87 ---- >> IOCTLS >> The following ioctl(2) calls are supported: >> ! MNTIOC_NMNTS Obtains the upper limit on the number >> ! of mounted resources. arg points to a >> ! uint32_t; this will be set to the upper >> ! limit on the number of mounted >> ! resources that will be identified by a >> ! subsequent MNTIOC_GETDEVLIST. >> ! MNTIOC_GETDEVLIST Obtains the actual number of mounted >> ! resources, together with their major >> ! and minor numbers. arg points to an >> ! array of uint_ts that must be at least >> ! twice as long as the length obtained by >> ! MNTIOC_NMNTS. The array will contain a >> ! pair of numbers for each mounted >> ! resource, comprising its major and >> ! minor numbers. >> + A resource will not be represented in >> + the array if it was mounted after the >> + preceding MNTIOC_NMNTS command. It is >> + an error to use MNTIOC_GETDEVLIST >> + without having first used MNTIOC_NMNTS. >> + + The number of mounted resources actu- >> + ally represented in the array will be >> + returned by the call to ioctl() itself. >> + The values of any remaining elements of >> + the array are undefined. >> + + A process that has used either >> + MNTIOC_NMNTS or MNTIOC_GETDEVLIST must >> + call resetmnttab(3C) before >> + getmntent(3C), getextmntent(3C) or >> + getmntany(3C). >> + MNTIOC_SETTAG Sets a tag word into the options list >> for a mounted file system. A tag is a >> notation that will appear in the >> *************** >> *** 101,109 **** >> location. >> EINVAL The tag specified in a MNTIOC_SETTAG call >> ! already exists as a file system option, or >> ! the tag specified in a MNTIOC_CLRTAG call >> ! does not exist. >> ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is >> too long or the tag would make the total >> --- 122,132 ---- >> location. >> EINVAL The tag specified in a MNTIOC_SETTAG call >> ! already exists as a file system option, the >> ! tag specified in a MNTIOC_CLRTAG call does >> ! not exist or a request for MNTIOC_GETDEVLIST >> ! was made without a prior request for >> ! MNTIOC_NMNTS. >> ENAMETOOLONG The tag specified in a MNTIOC_SETTAG call is >> too long or the tag would make the total >> *************** >> *** 144,156 **** >> ments. >> NOTES >> ! The snapshot of the mnttab information is taken any time a >> ! read(2) is performed at offset 0 (the beginning) of the >> ! mnttab file. The file modification time returned by stat(2) >> ! for the mnttab file is the time of the last change to >> ! mounted file system information. A poll(2) system call >> ! requesting a POLLRDBAND event can be used to block and wait >> ! for the system's mounted file system information to be dif- >> ! ferent from the most recent snapshot since the mnttab file >> ! was opened. >> --- 167,204 ---- >> ments. >> NOTES >> ! During a call to read(2) of /etc/mnttab, the corresponding >> ! in-kernel information cannot change. However, it will do so >> ! between successive calls to read(2) if, for example, >> ! resources are unmounted. The underlying file system, mntfs, >> ! implements two features to ensure that /etc/mnttab will con- >> ! tain sensible data even if there are changes to the in- >> ! kernel table of mounted resources. >> ! ! Firstly, if a call to read(2) terminates only part of the >> ! way through a line, then the next call to read(2) will start >> ! by reading the remainder of the interrupted line, even if >> ! the corresponding resource has been unmounted in the inter- >> ! vening time. >> ! ! Secondly, successive calls to read(2) will return 0 after >> ! reading the newest resource that was mounted at the time of >> ! the first call to read(2), even if, in the intervening time, >> ! additional resources have been mounted and are still >> ! present. >> ! ! Following a rewind(3C) of /etc/mnttab, or a call to >> ! resetmnttab(3C), the next call to read(2) will be considered >> ! the first: any saved remainder will be discarded and all >> ! resources mounted at that time are eligible to be read by >> ! subsequent calls to read(2). /etc/mnttab does not support >> ! the use of a file offset for any purpose other than rewind- >> ! ing the file. >> ! ! The file modification time returned by stat(2) for the >> ! mnttab file is the time of the last change to mounted file >> ! system information. A poll(2) system call requesting a >> ! POLLRDBAND event can be used to block and wait for the >> ! system's mounted file system information to be different >> ! from that at the time of the first read(2) of mnttab. >> >> *** getmntent.old Thu Jun 11 14:37:35 2009 >> --- getmntent.new Thu Jun 11 14:41:24 2009 >> *************** >> *** 40,51 **** >> Each getmntent() call causes a new line to be read from the >> mnttab file. Successive calls can be used to search the >> ! entire list. The getmntany() function searches the file >> ! referenced by fp until a match is found between a line in >> ! the file and mpref. A match occurs if all non-null entries >> ! in mpref match the corresponding fields in the file. These >> ! functions do not open, close, or rewind the file. >> getextmntent() >> The getextmntent() function is an extended version of the >> getmntent() function that returns, in addition to the infor- >> --- 40,58 ---- >> Each getmntent() call causes a new line to be read from the >> mnttab file. Successive calls can be used to search the >> ! entire list, although mnttab entries added by the kernel >> ! after the first call to getmntent() will be ignored. Follow- >> ! ing a call to resetmnttab(), the next call to getmntent() >> ! will be considered the first: all resources mounted at that >> ! time will be eligible to be read by subsequent calls to >> ! getmntent(). >> + The getmntany() function searches the file referenced by fp >> + until a match is found between a line in the file and mpref. >> + A match occurs if all non-null entries in mpref match the >> + corresponding fields in the file. These functions do not >> + open, close, or rewind the file. >> + getextmntent() >> The getextmntent() function is an extended version of the >> getmntent() function that returns, in addition to the infor- >> *************** >> *** 53,63 **** >> of the mounted resource to which the line in mnttab >> corresponds. The getextmntent() function also fills in the >> extmntent structure defined in the <sys/mnttab.h> header. >> ! For getextmntent() to function properly, it must be notified >> ! when the mnttab file has been reopened or rewound since a >> ! previous getextmntent() call. This notification is accom- >> ! plished by calling resetmnttab(). Otherwise, it behaves >> ! exactly as getmntent() described above. >> The data pointed to by the mnttab structure members are >> stored in a static area and must be copied to be saved >> --- 60,67 ---- >> of the mounted resource to which the line in mnttab >> corresponds. The getextmntent() function also fills in the >> extmntent structure defined in the <sys/mnttab.h> header. >> ! Otherwise, it behaves exactly as getmntent() described >> ! above. >> The data pointed to by the mnttab structure members are >> stored in a static area and must be copied to be saved >> *************** >> *** 77,89 **** >> sition purposes. >> resetmnttab() >> ! The resetmnttab() function notifies getextmntent() to reload >> ! from the kernel the device information that corresponds to >> ! the new snapshot of the mnttab information (see mnttab(4)). >> ! Subsequent getextmntent() calls then return correct >> ! extmnttab information. This function should be called when- >> ! ever the mnttab file is either rewound or closed and reo- >> ! pened before any calls are made to getextmntent(). >> RETURN VALUES >> getmntent() and getmntany() >> --- 81,91 ---- >> sition purposes. >> resetmnttab() >> ! The resetmnttab() function causes the next call to >> ! getmntent(), getextmntent() or getmntany() to behave as >> ! though /etc/mnttab had just been opened. In addition, this >> ! function will have a similar effect on read(2); see >> ! mnttab(4) for more details. >> RETURN VALUES >> getmntent() and getmntany() >> >> >> 7. References: >> >> 1. CR 6394241 mntfs is not exec safe >> >> 2. CR 6813502 mntfs will leak mappings when called from a forking MT >> program. >> >> 3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense >> >> 6. Resources and Schedule >> 6.4. Steering Committee requested information >> 6.4.1. Consolidation C-team Name: >> ON >> 6.5. ARC review type: FastTrack >> 6.6. ARC Exposure: open >> >> > -- blu It's bad civic hygiene to build technologies that could someday be used to facilitate a police state. - Bruce Schneier ---------------------------------------------------------------------- Brian Utterback - Solaris RPE, Sun Microsystems, Inc. Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom