I little house cleaning. After this case was approved, the project
team decided to take a different approach which was submitted and
approved in PSARC 2009/566. Since the approach in 2009/352 is no
longer valid, I am marking it as withdrawn to avoid future confusion.


> Brian Utterback wrote:
>> I am sponsoring this fasttrack on behalf of Robert Harris. The
>> timeout is set to 06/19/2009. Requested binding is patch.
>> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
>> This information is Copyright 2009 Sun Microsystems
>> 1. Introduction
>>     1.1. Project/Component Working Name:
>>      Abandon the use of snapshots in mntfs.
>>     1.2. Name of Document Author/Supplier:
>>      Author:  Robert Harris
>>     1.3  Date of This Document:
>>     12 June, 2009
>> 4. Technical Description
>> 1. Proposal:
>>
>>     Abandon the use of snapshots in mntfs.
>>
>>
>> 2. The Problem:
>>
>>     The contents of /etc/mnttab are created by mntfs on demand.
>>     mntfs parses the in-kernel mnttab structures to create a
>>     snapshot that can be used to satisfy subsequent calls to
>>     read() or ioctl(). The snapshot is stored by the kernel
>>     within the address space of the process that made the first
>>     call to read() or ioctl(). The enclosing mapping is removed
>>     from the calling process's address space by mntfs on last
>>     close().
>>
>>     The snapshot-in-userland design has a flaw: the kernel cannot
>>     determine whether or not a close() is a specific process's
>>     last if the vnode count is greater than 1. This is because
>>     there is no way to determine whether a count that is greater
>>     than one has originated from dup(), from fork() or from
>>     both.
>>
>>     This means that mntfs is unable to ensure that every
>>     insertion of a mapping into a process's address space is
>>     paired with a corresponding deletion. Two specific
>>     manifestations are 6394241, in which a newly-execed process
>>     has an arbitrary range of its address space unmapped by
>>     mntfs, and 6813502, in which a process address space is
>>     entirely consumed by orphaned mappings left behind by mntfs.
>>
>>
>> 3. Solutions:
>>
>>     The most obvious solution seemed, at first, to involve
>>     storing the snapshot data within the corresponding vnode,
>>     thereby allowing the existing file system infrastructure to
>>     free the resources when no longer required. This, however,
>>     was rejected on account of complications inherent in the
>>     unprivileged user's resulting ability to allocate and retain
>>     kernel memory.
>>
>>     The only choice left has been to abandon the use of snapshots
>>     in their current form. This necessitates some minor changes
>>     to the behaviour of /etc/mnttab and its API, described in
>>     mnttab(4) and getmntent(3C).
>>
>>     The current snapshot implementation means that, until a call
>>     to close() or resetmnttab(), clients reading /etc/mnttab will
>>     see those resources that were mounted at the time the
>>     snapshot was created, i.e. at the first read() or ioctl().
>>     Thus resources that have been unmounted in the intervening
>>     time will still appear to be present.
>>     
>>     With the proposed changes, a process will not see any
>>     resources that have been unmounted since the first call to
>>     read() or ioctl(), with one exception: if a call to read()
>>     terminates in the middle of a line, then the next read() will
>>     be obliged to consume the remainder of that line, even if the
>>     corresponding resource has been unmounted in the intervening
>>     time. This prevents the possibility of seemingly-garbled
>>     text.
>>
>>     Note that where the remainder of a line is stored for
>>     possible later consumption, it is kept on the corresponding
>>     vnode's private structure.
>>
>>
>> 4. Impact:
>>
>> 4.1 Overview:
>>
>>     The current API includes an ioctl for obtaining the number of
>>     mounted resources within the snapshot (MNTIOC_NMNTS) and
>>     another ioctl for obtaining the major and minor numbers for
>>     these resources (MNTIOC_GETDEVLIST). The first ioctl is used
>>     to obtain the size of an array to pass to the second ioctl.
>>     
>>     Following the proposed changes, MNTIOC_NMNTS will return the
>>     number of resources currently mounted by the kernel.
>>     However, many of the mounted resources are usually hidden;
>>     they never appear during a read() of /etc/mnttab, and are
>>     visible to ioctl() only when specifically requested.  The
>>     value returned by MNTIOC_NMNTS will therefore be viewed by
>>     the majority of consumers as an over-estimate of the number
>>     of mounted resources. In reality, the value obtained by
>>     MNTIOC_NMNTS will be defined as the upper-limit on the number
>>     of mounted resources, and should be used only to determine
>>     the length of the array passed to MNTIOC_GETDEVLIST.
>>     
>>     MNTIOC_GETDEVLIST will, following the proposed changes,
>>     populate the supplied array with the major and minor
>>     numbers of only those mouted resources that are
>>     visible to the user. Typically, hence, this will leave
>>     many entries in the supplied array undefined. With
>>     the proposed changes, the MNTIOC_GETDEVLIST ioctl()
>>     itself will return the number of mounted resources,
>>     and hence the number of meaningful entries in the
>>     supplied array. In the current mntfs implementation,
>>     an ioctl() for MNTIOC_GETDEVLIST does not employ
>>     its return value for anything other than to indicate
>>     an error.
>>     
>>     In theory, then, this change introduces a backwards
>>     incompatability: existing code that uses MNTIOC_NMNTS and
>>     then MNTIOC_GETDEVLIST to obtain the major and minor numbers
>>     of mounted resources will find that the last entries are
>>     meaningless. However, MNTIOC_GETDEVLIST has not worked since
>>     S10 FCS: it now returns nonsense, as described in 6814666.
>>     
>>     Implementing the proposed changes calls for additions to the
>>     zone_t and vfs_t structs. The zone_t will acquire a pointer
>>     to an avl_tree_t, and the vfs_t will acquire a pointer to a
>>     newly-defined structure. The purpose is to allow each vfs_t
>>     to be stored in an AVL tree, sorted by a unique
>>     high-resolution time. This is to allow rapid location of the
>>     next available vfs_t in the mnttab table. If its predecessor
>>     were unmounted then there would be no vfs_next pointer to
>>     follow, and a linear search would otherwise be required from
>>     the start of the circularly-linked list.
>>     
>> 4.2 Interface changes:
>>
>>     1. The MNTIOC_GETDEVLIST command is modified so that the
>>        calling ioctl() returns the number of mounted resources
>>        represented in the supplied array, which is the same
>>        as the number of visible resources mounted on the system.
>>        This interface will be Uncommitted.
>>            2. The vfs struct acquires a new member, vfs_mntmeta, which
>>        is a pointer to a new, private structure with type
>>        'struct vfs_mntmeta'. The new member and the private
>>        structure will constitute a Private interface.
>>            3. The zone struct acquires a new member, zone_vfstree,
>>        which is a pointer to an avl_tree_t. The new member
>>        will constitute a Private interface.
>>
>>
>> 5. Release binding:
>>
>>     Patch.
>>
>>
>> 6. Documentation impact:
>>
>>     Changes to the mnttab(4) and getmntent(3C) man pages:
>>     
>> *** mnttab.old    Thu Jun 11 14:40:19 2009
>> --- mnttab.new    Thu Jun 11 14:38:09 2009
>> ***************
>> *** 47,66 ****
>>   IOCTLS
>>        The following ioctl(2) calls are supported:
>>   !      MNTIOC_NMNTS         Returns the count of mounted  resources
>> !                           in the current snapshot in the uint32_t
>> !                           pointed to by arg.
>>   !      MNTIOC_GETDEVLIST    Returns an array of uint32_t's that  is
>> !                           twice as long as the length returned by
>> !                           MNTIOC_NMNTS. Each pair of  numbers  is
>> !                           the  major  and minor device number for
>> !                           the file system  at  the  corresponding
>> !                           line   in   the   current   /etc/mnttab
>> !                           snapshot.  arg  points  to  the  memory
>> !                           buffer  to  receive  the  device number
>> !                           information.
>>          MNTIOC_SETTAG        Sets a tag word into the  options  list
>>                             for  a  mounted file system. A tag is a
>>                             notation  that  will  appear   in   the
>> --- 47,87 ----
>>   IOCTLS
>>        The following ioctl(2) calls are supported:
>>   !      MNTIOC_NMNTS         Obtains the upper limit on  the  number
>> !                           of  mounted  resources. arg points to a
>> !                           uint32_t; this will be set to the upper
>> !                           limit   on   the   number   of  mounted
>> !                           resources that will be identified by  a
>> !                           subsequent MNTIOC_GETDEVLIST.
>>   !      MNTIOC_GETDEVLIST    Obtains the actual  number  of  mounted
>> !                           resources,  together  with  their major
>> !                           and minor numbers.  arg  points  to  an
>> !                           array  of uint_ts that must be at least
>> !                           twice as long as the length obtained by
>> !                           MNTIOC_NMNTS.  The array will contain a
>> !                           pair  of  numbers  for   each   mounted
>> !                           resource,   comprising  its  major  and
>> !                           minor numbers.
>>   +                           A resource will not be  represented  in
>> +                           the  array  if it was mounted after the
>> +                           preceding MNTIOC_NMNTS command.  It  is
>> +                           an   error   to  use  MNTIOC_GETDEVLIST
>> +                           without having first used MNTIOC_NMNTS.
>> + +                           The number of mounted  resources  actu-
>> +                           ally  represented  in the array will be
>> +                           returned by the call to ioctl() itself.
>> +                           The values of any remaining elements of
>> +                           the array are undefined.
>> + +                           A  process   that   has   used   either
>> +                           MNTIOC_NMNTS  or MNTIOC_GETDEVLIST must
>> +                           call       resetmnttab(3C)       before
>> +                           getmntent(3C),    getextmntent(3C)   or
>> +                           getmntany(3C).
>> +        MNTIOC_SETTAG        Sets a tag word into the  options  list
>>                             for  a  mounted file system. A tag is a
>>                             notation  that  will  appear   in   the
>> ***************
>> *** 101,109 ****
>>                        location.
>>          EINVAL          The tag specified in  a  MNTIOC_SETTAG  call
>> !                      already  exists  as a file system option, or
>> !                      the tag specified in  a  MNTIOC_CLRTAG  call
>> !                      does not exist.
>>          ENAMETOOLONG    The tag specified in a MNTIOC_SETTAG call is
>>                        too  long  or  the  tag would make the total
>> --- 122,132 ----
>>                        location.
>>          EINVAL          The tag specified in  a  MNTIOC_SETTAG  call
>> !                      already  exists as a file system option, the
>> !                      tag specified in a MNTIOC_CLRTAG  call  does
>> !                      not exist or a request for MNTIOC_GETDEVLIST
>> !                      was  made  without  a  prior   request   for
>> !                      MNTIOC_NMNTS.
>>          ENAMETOOLONG    The tag specified in a MNTIOC_SETTAG call is
>>                        too  long  or  the  tag would make the total
>> ***************
>> *** 144,156 ****
>>        ments.
>>     NOTES
>> !      The snapshot of the mnttab information is taken any  time  a
>> !      read(2)  is  performed  at  offset  0 (the beginning) of the
>> !      mnttab file. The file modification time returned by  stat(2)
>> !      for  the  mnttab  file  is  the  time  of the last change to
>> !      mounted file  system  information.  A  poll(2)  system  call
>> !      requesting  a POLLRDBAND event can be used to block and wait
>> !      for the system's mounted file system information to be  dif-
>> !      ferent  from  the most recent snapshot since the mnttab file
>> !      was opened.
>>   --- 167,204 ----
>>        ments.
>>     NOTES
>> !      During a call to read(2) of /etc/mnttab,  the  corresponding
>> !      in-kernel  information cannot change. However, it will do so
>> !      between  successive  calls  to  read(2)  if,  for   example,
>> !      resources  are unmounted. The underlying file system, mntfs,
>> !      implements two features to ensure that /etc/mnttab will con-
>> !      tain  sensible  data  even  if  there are changes to the in-
>> !      kernel table of mounted resources.
>> ! !      Firstly, if a call to read(2) terminates only  part  of  the
>> !      way through a line, then the next call to read(2) will start
>> !      by reading the remainder of the interrupted  line,  even  if
>> !      the  corresponding resource has been unmounted in the inter-
>> !      vening time.
>> ! !      Secondly, successive calls to read(2) will  return  0  after
>> !      reading  the newest resource that was mounted at the time of
>> !      the first call to read(2), even if, in the intervening time,
>> !      additional   resources  have  been  mounted  and  are  still
>> !      present.
>> ! !      Following  a  rewind(3C)  of  /etc/mnttab,  or  a  call   to
>> !      resetmnttab(3C), the next call to read(2) will be considered
>> !      the first: any saved remainder will  be  discarded  and  all
>> !      resources  mounted  at  that time are eligible to be read by
>> !      subsequent calls to read(2). /etc/mnttab  does  not  support
>> !      the  use of a file offset for any purpose other than rewind-
>> !      ing the file.
>> ! !      The file modification  time  returned  by  stat(2)  for  the
>> !      mnttab  file  is the time of the last change to mounted file
>> !      system information.  A  poll(2)  system  call  requesting  a
>> !      POLLRDBAND  event  can  be  used  to  block and wait for the
>> !      system's mounted file system  information  to  be  different
>> !      from that at the time of the first read(2) of mnttab.
>>       
>> *** getmntent.old    Thu Jun 11 14:37:35 2009
>> --- getmntent.new    Thu Jun 11 14:41:24 2009
>> ***************
>> *** 40,51 ****
>>          Each getmntent() call causes a new line to be read from  the
>>        mnttab  file.  Successive  calls  can  be used to search the
>> !      entire list. The  getmntany()  function  searches  the  file
>> !      referenced  by  fp  until a match is found between a line in
>> !      the file and mpref. A match occurs if all  non-null  entries
>> !      in  mpref  match the corresponding fields in the file. These
>> !      functions do not open, close, or rewind the file.
>>       getextmntent()
>>        The getextmntent() function is an extended  version  of  the
>>        getmntent() function that returns, in addition to the infor-
>> --- 40,58 ----
>>          Each getmntent() call causes a new line to be read from  the
>>        mnttab  file.  Successive  calls  can  be used to search the
>> !      entire list, although mnttab entries  added  by  the  kernel
>> !      after the first call to getmntent() will be ignored. Follow-
>> !      ing a call to resetmnttab(), the next  call  to  getmntent()
>> !      will  be considered the first: all resources mounted at that
>> !      time will be eligible to be  read  by  subsequent  calls  to
>> !      getmntent().
>>   +      The getmntany() function searches the file referenced by  fp
>> +      until a match is found between a line in the file and mpref.
>> +      A match occurs if all non-null entries in  mpref  match  the
>> +      corresponding  fields  in  the  file. These functions do not
>> +      open, close, or rewind the file.
>> +     getextmntent()
>>        The getextmntent() function is an extended  version  of  the
>>        getmntent() function that returns, in addition to the infor-
>> ***************
>> *** 53,63 ****
>>        of  the  mounted  resource  to  which  the  line  in  mnttab
>>        corresponds. The getextmntent() function also fills  in  the
>>        extmntent  structure  defined  in the <sys/mnttab.h> header.
>> !      For getextmntent() to function properly, it must be notified
>> !      when  the  mnttab  file has been reopened or rewound since a
>> !      previous getextmntent() call.  This notification  is  accom-
>> !      plished  by  calling  resetmnttab().  Otherwise,  it behaves
>> !      exactly as getmntent() described above.
>>          The data pointed to by  the  mnttab  structure  members  are
>>        stored  in  a  static  area  and  must be copied to be saved
>> --- 60,67 ----
>>        of  the  mounted  resource  to  which  the  line  in  mnttab
>>        corresponds. The getextmntent() function also fills  in  the
>>        extmntent  structure  defined  in the <sys/mnttab.h> header.
>> !      Otherwise,  it  behaves  exactly  as  getmntent()  described
>> !      above.
>>          The data pointed to by  the  mnttab  structure  members  are
>>        stored  in  a  static  area  and  must be copied to be saved
>> ***************
>> *** 77,89 ****
>>        sition purposes.
>>       resetmnttab()
>> !      The resetmnttab() function notifies getextmntent() to reload
>> !      from  the  kernel the device information that corresponds to
>> !      the new snapshot of the mnttab information (see  mnttab(4)).
>> !      Subsequent   getextmntent()   calls   then   return  correct
>> !      extmnttab information. This function should be called  when-
>> !      ever  the  mnttab  file is either rewound or closed and reo-
>> !      pened before any calls are made to getextmntent().
>>     RETURN VALUES
>>     getmntent() and getmntany()
>> --- 81,91 ----
>>        sition purposes.
>>       resetmnttab()
>> !      The  resetmnttab()  function  causes  the   next   call   to
>> !      getmntent(),  getextmntent()  or  getmntany()  to  behave as
>> !      though /etc/mnttab had just been opened. In  addition,  this
>> !      function   will  have  a  similar  effect  on  read(2);  see
>> !      mnttab(4) for more details.
>>     RETURN VALUES
>>     getmntent() and getmntany()
>>     
>>
>> 7. References:
>>
>> 1. CR 6394241 mntfs is not exec safe
>>
>> 2. CR 6813502 mntfs will leak mappings when called from a forking MT
>> program.
>>
>> 3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense
>>
>> 6. Resources and Schedule
>>     6.4. Steering Committee requested information
>>        6.4.1. Consolidation C-team Name:
>>         ON
>>     6.5. ARC review type: FastTrack
>>     6.6. ARC Exposure: open
>>
>>   
> 

-- 
blu

It's bad civic hygiene to build technologies that could someday be
used to facilitate a police state. - Bruce Schneier
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom

Reply via email to