+1.

    -- Garrett

Brian Utterback wrote:
> I am sponsoring this fasttrack on behalf of Robert Harris. The
> timeout is set to 06/19/2009. Requested binding is patch. 
>
> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
> This information is Copyright 2009 Sun Microsystems
> 1. Introduction
>     1.1. Project/Component Working Name:
>        Abandon the use of snapshots in mntfs.
>     1.2. Name of Document Author/Supplier:
>        Author:  Robert Harris
>     1.3  Date of This Document:
>       12 June, 2009
> 4. Technical Description
> 1. Proposal:
>
>       Abandon the use of snapshots in mntfs.
>
>
> 2. The Problem:
>
>       The contents of /etc/mnttab are created by mntfs on demand.
>       mntfs parses the in-kernel mnttab structures to create a
>       snapshot that can be used to satisfy subsequent calls to
>       read() or ioctl(). The snapshot is stored by the kernel
>       within the address space of the process that made the first
>       call to read() or ioctl(). The enclosing mapping is removed
>       from the calling process's address space by mntfs on last
>       close().
>
>       The snapshot-in-userland design has a flaw: the kernel cannot
>       determine whether or not a close() is a specific process's
>       last if the vnode count is greater than 1. This is because
>       there is no way to determine whether a count that is greater
>       than one has originated from dup(), from fork() or from
>       both.
>
>       This means that mntfs is unable to ensure that every
>       insertion of a mapping into a process's address space is
>       paired with a corresponding deletion. Two specific
>       manifestations are 6394241, in which a newly-execed process
>       has an arbitrary range of its address space unmapped by
>       mntfs, and 6813502, in which a process address space is
>       entirely consumed by orphaned mappings left behind by mntfs.
>
>
> 3. Solutions:
>
>       The most obvious solution seemed, at first, to involve
>       storing the snapshot data within the corresponding vnode,
>       thereby allowing the existing file system infrastructure to
>       free the resources when no longer required. This, however,
>       was rejected on account of complications inherent in the
>       unprivileged user's resulting ability to allocate and retain
>       kernel memory.
>
>       The only choice left has been to abandon the use of snapshots
>       in their current form. This necessitates some minor changes
>       to the behaviour of /etc/mnttab and its API, described in
>       mnttab(4) and getmntent(3C).
>
>       The current snapshot implementation means that, until a call
>       to close() or resetmnttab(), clients reading /etc/mnttab will
>       see those resources that were mounted at the time the
>       snapshot was created, i.e. at the first read() or ioctl().
>       Thus resources that have been unmounted in the intervening
>       time will still appear to be present.
>       
>       With the proposed changes, a process will not see any
>       resources that have been unmounted since the first call to
>       read() or ioctl(), with one exception: if a call to read()
>       terminates in the middle of a line, then the next read() will
>       be obliged to consume the remainder of that line, even if the
>       corresponding resource has been unmounted in the intervening
>       time. This prevents the possibility of seemingly-garbled
>       text.
>
>       Note that where the remainder of a line is stored for
>       possible later consumption, it is kept on the corresponding
>       vnode's private structure.
>
>
> 4. Impact:
>
> 4.1 Overview:
>
>       The current API includes an ioctl for obtaining the number of
>       mounted resources within the snapshot (MNTIOC_NMNTS) and
>       another ioctl for obtaining the major and minor numbers for
>       these resources (MNTIOC_GETDEVLIST). The first ioctl is used
>       to obtain the size of an array to pass to the second ioctl.
>       
>       Following the proposed changes, MNTIOC_NMNTS will return the
>       number of resources currently mounted by the kernel.
>       However, many of the mounted resources are usually hidden;
>       they never appear during a read() of /etc/mnttab, and are
>       visible to ioctl() only when specifically requested.  The
>       value returned by MNTIOC_NMNTS will therefore be viewed by
>       the majority of consumers as an over-estimate of the number
>       of mounted resources. In reality, the value obtained by
>       MNTIOC_NMNTS will be defined as the upper-limit on the number
>       of mounted resources, and should be used only to determine
>       the length of the array passed to MNTIOC_GETDEVLIST.
>       
>       MNTIOC_GETDEVLIST will, following the proposed changes,
>       populate the supplied array with the major and minor
>       numbers of only those mouted resources that are
>       visible to the user. Typically, hence, this will leave
>       many entries in the supplied array undefined. With
>       the proposed changes, the MNTIOC_GETDEVLIST ioctl()
>       itself will return the number of mounted resources,
>       and hence the number of meaningful entries in the
>       supplied array. In the current mntfs implementation,
>       an ioctl() for MNTIOC_GETDEVLIST does not employ
>       its return value for anything other than to indicate
>       an error.
>       
>       In theory, then, this change introduces a backwards
>       incompatability: existing code that uses MNTIOC_NMNTS and
>       then MNTIOC_GETDEVLIST to obtain the major and minor numbers
>       of mounted resources will find that the last entries are
>       meaningless. However, MNTIOC_GETDEVLIST has not worked since
>       S10 FCS: it now returns nonsense, as described in 6814666.
>       
>       Implementing the proposed changes calls for additions to the
>       zone_t and vfs_t structs. The zone_t will acquire a pointer
>       to an avl_tree_t, and the vfs_t will acquire a pointer to a
>       newly-defined structure. The purpose is to allow each vfs_t
>       to be stored in an AVL tree, sorted by a unique
>       high-resolution time. This is to allow rapid location of the
>       next available vfs_t in the mnttab table. If its predecessor
>       were unmounted then there would be no vfs_next pointer to
>       follow, and a linear search would otherwise be required from
>       the start of the circularly-linked list.
>       
> 4.2 Interface changes:
>
>       1. The MNTIOC_GETDEVLIST command is modified so that the
>          calling ioctl() returns the number of mounted resources
>          represented in the supplied array, which is the same
>          as the number of visible resources mounted on the system.
>          This interface will be Uncommitted.
>          
>       2. The vfs struct acquires a new member, vfs_mntmeta, which
>          is a pointer to a new, private structure with type
>          'struct vfs_mntmeta'. The new member and the private
>          structure will constitute a Private interface.
>          
>       3. The zone struct acquires a new member, zone_vfstree,
>          which is a pointer to an avl_tree_t. The new member
>          will constitute a Private interface.
>
>
> 5. Release binding:
>
>       Patch.
>
>
> 6. Documentation impact:
>
>       Changes to the mnttab(4) and getmntent(3C) man pages:
>       
> *** mnttab.old        Thu Jun 11 14:40:19 2009
> --- mnttab.new        Thu Jun 11 14:38:09 2009
> ***************
> *** 47,66 ****
>   IOCTLS
>        The following ioctl(2) calls are supported:
>   
> !      MNTIOC_NMNTS         Returns the count of mounted  resources
> !                           in the current snapshot in the uint32_t
> !                           pointed to by arg.
>   
> !      MNTIOC_GETDEVLIST    Returns an array of uint32_t's that  is
> !                           twice as long as the length returned by
> !                           MNTIOC_NMNTS. Each pair of  numbers  is
> !                           the  major  and minor device number for
> !                           the file system  at  the  corresponding
> !                           line   in   the   current   /etc/mnttab
> !                           snapshot.  arg  points  to  the  memory
> !                           buffer  to  receive  the  device number
> !                           information.
>   
>        MNTIOC_SETTAG        Sets a tag word into the  options  list
>                             for  a  mounted file system. A tag is a
>                             notation  that  will  appear   in   the
> --- 47,87 ----
>   IOCTLS
>        The following ioctl(2) calls are supported:
>   
> !      MNTIOC_NMNTS         Obtains the upper limit on  the  number
> !                           of  mounted  resources. arg points to a
> !                           uint32_t; this will be set to the upper
> !                           limit   on   the   number   of  mounted
> !                           resources that will be identified by  a
> !                           subsequent MNTIOC_GETDEVLIST.
>   
> !      MNTIOC_GETDEVLIST    Obtains the actual  number  of  mounted
> !                           resources,  together  with  their major
> !                           and minor numbers.  arg  points  to  an
> !                           array  of uint_ts that must be at least
> !                           twice as long as the length obtained by
> !                           MNTIOC_NMNTS.  The array will contain a
> !                           pair  of  numbers  for   each   mounted
> !                           resource,   comprising  its  major  and
> !                           minor numbers.
>   
> +                           A resource will not be  represented  in
> +                           the  array  if it was mounted after the
> +                           preceding MNTIOC_NMNTS command.  It  is
> +                           an   error   to  use  MNTIOC_GETDEVLIST
> +                           without having first used MNTIOC_NMNTS.
> + 
> +                           The number of mounted  resources  actu-
> +                           ally  represented  in the array will be
> +                           returned by the call to ioctl() itself.
> +                           The values of any remaining elements of
> +                           the array are undefined.
> + 
> +                           A  process   that   has   used   either
> +                           MNTIOC_NMNTS  or MNTIOC_GETDEVLIST must
> +                           call       resetmnttab(3C)       before
> +                           getmntent(3C),    getextmntent(3C)   or
> +                           getmntany(3C).
> + 
>        MNTIOC_SETTAG        Sets a tag word into the  options  list
>                             for  a  mounted file system. A tag is a
>                             notation  that  will  appear   in   the
> ***************
> *** 101,109 ****
>                        location.
>   
>        EINVAL          The tag specified in  a  MNTIOC_SETTAG  call
> !                      already  exists  as a file system option, or
> !                      the tag specified in  a  MNTIOC_CLRTAG  call
> !                      does not exist.
>   
>        ENAMETOOLONG    The tag specified in a MNTIOC_SETTAG call is
>                        too  long  or  the  tag would make the total
> --- 122,132 ----
>                        location.
>   
>        EINVAL          The tag specified in  a  MNTIOC_SETTAG  call
> !                      already  exists as a file system option, the
> !                      tag specified in a MNTIOC_CLRTAG  call  does
> !                      not exist or a request for MNTIOC_GETDEVLIST
> !                      was  made  without  a  prior   request   for
> !                      MNTIOC_NMNTS.
>   
>        ENAMETOOLONG    The tag specified in a MNTIOC_SETTAG call is
>                        too  long  or  the  tag would make the total
> ***************
> *** 144,156 ****
>        ments.
>   
>   NOTES
> !      The snapshot of the mnttab information is taken any  time  a
> !      read(2)  is  performed  at  offset  0 (the beginning) of the
> !      mnttab file. The file modification time returned by  stat(2)
> !      for  the  mnttab  file  is  the  time  of the last change to
> !      mounted file  system  information.  A  poll(2)  system  call
> !      requesting  a POLLRDBAND event can be used to block and wait
> !      for the system's mounted file system information to be  dif-
> !      ferent  from  the most recent snapshot since the mnttab file
> !      was opened.
>   
> --- 167,204 ----
>        ments.
>   
>   NOTES
> !      During a call to read(2) of /etc/mnttab,  the  corresponding
> !      in-kernel  information cannot change. However, it will do so
> !      between  successive  calls  to  read(2)  if,  for   example,
> !      resources  are unmounted. The underlying file system, mntfs,
> !      implements two features to ensure that /etc/mnttab will con-
> !      tain  sensible  data  even  if  there are changes to the in-
> !      kernel table of mounted resources.
> ! 
> !      Firstly, if a call to read(2) terminates only  part  of  the
> !      way through a line, then the next call to read(2) will start
> !      by reading the remainder of the interrupted  line,  even  if
> !      the  corresponding resource has been unmounted in the inter-
> !      vening time.
> ! 
> !      Secondly, successive calls to read(2) will  return  0  after
> !      reading  the newest resource that was mounted at the time of
> !      the first call to read(2), even if, in the intervening time,
> !      additional   resources  have  been  mounted  and  are  still
> !      present.
> ! 
> !      Following  a  rewind(3C)  of  /etc/mnttab,  or  a  call   to
> !      resetmnttab(3C), the next call to read(2) will be considered
> !      the first: any saved remainder will  be  discarded  and  all
> !      resources  mounted  at  that time are eligible to be read by
> !      subsequent calls to read(2). /etc/mnttab  does  not  support
> !      the  use of a file offset for any purpose other than rewind-
> !      ing the file.
> ! 
> !      The file modification  time  returned  by  stat(2)  for  the
> !      mnttab  file  is the time of the last change to mounted file
> !      system information.  A  poll(2)  system  call  requesting  a
> !      POLLRDBAND  event  can  be  used  to  block and wait for the
> !      system's mounted file system  information  to  be  different
> !      from that at the time of the first read(2) of mnttab.
>   
>       
> *** getmntent.old     Thu Jun 11 14:37:35 2009
> --- getmntent.new     Thu Jun 11 14:41:24 2009
> ***************
> *** 40,51 ****
>   
>        Each getmntent() call causes a new line to be read from  the
>        mnttab  file.  Successive  calls  can  be used to search the
> !      entire list. The  getmntany()  function  searches  the  file
> !      referenced  by  fp  until a match is found between a line in
> !      the file and mpref. A match occurs if all  non-null  entries
> !      in  mpref  match the corresponding fields in the file. These
> !      functions do not open, close, or rewind the file.
>   
>     getextmntent()
>        The getextmntent() function is an extended  version  of  the
>        getmntent() function that returns, in addition to the infor-
> --- 40,58 ----
>   
>        Each getmntent() call causes a new line to be read from  the
>        mnttab  file.  Successive  calls  can  be used to search the
> !      entire list, although mnttab entries  added  by  the  kernel
> !      after the first call to getmntent() will be ignored. Follow-
> !      ing a call to resetmnttab(), the next  call  to  getmntent()
> !      will  be considered the first: all resources mounted at that
> !      time will be eligible to be  read  by  subsequent  calls  to
> !      getmntent().
>   
> +      The getmntany() function searches the file referenced by  fp
> +      until a match is found between a line in the file and mpref.
> +      A match occurs if all non-null entries in  mpref  match  the
> +      corresponding  fields  in  the  file. These functions do not
> +      open, close, or rewind the file.
> + 
>     getextmntent()
>        The getextmntent() function is an extended  version  of  the
>        getmntent() function that returns, in addition to the infor-
> ***************
> *** 53,63 ****
>        of  the  mounted  resource  to  which  the  line  in  mnttab
>        corresponds. The getextmntent() function also fills  in  the
>        extmntent  structure  defined  in the <sys/mnttab.h> header.
> !      For getextmntent() to function properly, it must be notified
> !      when  the  mnttab  file has been reopened or rewound since a
> !      previous getextmntent() call.  This notification  is  accom-
> !      plished  by  calling  resetmnttab().  Otherwise,  it behaves
> !      exactly as getmntent() described above.
>   
>        The data pointed to by  the  mnttab  structure  members  are
>        stored  in  a  static  area  and  must be copied to be saved
> --- 60,67 ----
>        of  the  mounted  resource  to  which  the  line  in  mnttab
>        corresponds. The getextmntent() function also fills  in  the
>        extmntent  structure  defined  in the <sys/mnttab.h> header.
> !      Otherwise,  it  behaves  exactly  as  getmntent()  described
> !      above.
>   
>        The data pointed to by  the  mnttab  structure  members  are
>        stored  in  a  static  area  and  must be copied to be saved
> ***************
> *** 77,89 ****
>        sition purposes.
>   
>     resetmnttab()
> !      The resetmnttab() function notifies getextmntent() to reload
> !      from  the  kernel the device information that corresponds to
> !      the new snapshot of the mnttab information (see  mnttab(4)).
> !      Subsequent   getextmntent()   calls   then   return  correct
> !      extmnttab information. This function should be called  when-
> !      ever  the  mnttab  file is either rewound or closed and reo-
> !      pened before any calls are made to getextmntent().
>   
>   RETURN VALUES
>     getmntent() and getmntany()
> --- 81,91 ----
>        sition purposes.
>   
>     resetmnttab()
> !      The  resetmnttab()  function  causes  the   next   call   to
> !      getmntent(),  getextmntent()  or  getmntany()  to  behave as
> !      though /etc/mnttab had just been opened. In  addition,  this
> !      function   will  have  a  similar  effect  on  read(2);  see
> !      mnttab(4) for more details.
>   
>   RETURN VALUES
>     getmntent() and getmntany()
>       
>
> 7. References:
>
> 1. CR 6394241 mntfs is not exec safe
>
> 2. CR 6813502 mntfs will leak mappings when called from a forking MT program.
>
> 3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense
>
> 6. Resources and Schedule
>     6.4. Steering Committee requested information
>       6.4.1. Consolidation C-team Name:
>               ON
>     6.5. ARC review type: FastTrack
>     6.6. ARC Exposure: open
>
>   


Reply via email to