I am sponsoring this fasttrack on behalf of Robert Harris. The
timeout is set to 06/19/2009. Requested binding is patch. 

Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
This information is Copyright 2009 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Abandon the use of snapshots in mntfs.
    1.2. Name of Document Author/Supplier:
         Author:  Robert Harris
    1.3  Date of This Document:
        12 June, 2009
4. Technical Description
1. Proposal:

        Abandon the use of snapshots in mntfs.


2. The Problem:

        The contents of /etc/mnttab are created by mntfs on demand.
        mntfs parses the in-kernel mnttab structures to create a
        snapshot that can be used to satisfy subsequent calls to
        read() or ioctl(). The snapshot is stored by the kernel
        within the address space of the process that made the first
        call to read() or ioctl(). The enclosing mapping is removed
        from the calling process's address space by mntfs on last
        close().

        The snapshot-in-userland design has a flaw: the kernel cannot
        determine whether or not a close() is a specific process's
        last if the vnode count is greater than 1. This is because
        there is no way to determine whether a count that is greater
        than one has originated from dup(), from fork() or from
        both.

        This means that mntfs is unable to ensure that every
        insertion of a mapping into a process's address space is
        paired with a corresponding deletion. Two specific
        manifestations are 6394241, in which a newly-execed process
        has an arbitrary range of its address space unmapped by
        mntfs, and 6813502, in which a process address space is
        entirely consumed by orphaned mappings left behind by mntfs.


3. Solutions:

        The most obvious solution seemed, at first, to involve
        storing the snapshot data within the corresponding vnode,
        thereby allowing the existing file system infrastructure to
        free the resources when no longer required. This, however,
        was rejected on account of complications inherent in the
        unprivileged user's resulting ability to allocate and retain
        kernel memory.

        The only choice left has been to abandon the use of snapshots
        in their current form. This necessitates some minor changes
        to the behaviour of /etc/mnttab and its API, described in
        mnttab(4) and getmntent(3C).

        The current snapshot implementation means that, until a call
        to close() or resetmnttab(), clients reading /etc/mnttab will
        see those resources that were mounted at the time the
        snapshot was created, i.e. at the first read() or ioctl().
        Thus resources that have been unmounted in the intervening
        time will still appear to be present.
        
        With the proposed changes, a process will not see any
        resources that have been unmounted since the first call to
        read() or ioctl(), with one exception: if a call to read()
        terminates in the middle of a line, then the next read() will
        be obliged to consume the remainder of that line, even if the
        corresponding resource has been unmounted in the intervening
        time. This prevents the possibility of seemingly-garbled
        text.

        Note that where the remainder of a line is stored for
        possible later consumption, it is kept on the corresponding
        vnode's private structure.


4. Impact:

4.1 Overview:

        The current API includes an ioctl for obtaining the number of
        mounted resources within the snapshot (MNTIOC_NMNTS) and
        another ioctl for obtaining the major and minor numbers for
        these resources (MNTIOC_GETDEVLIST). The first ioctl is used
        to obtain the size of an array to pass to the second ioctl.
        
        Following the proposed changes, MNTIOC_NMNTS will return the
        number of resources currently mounted by the kernel.
        However, many of the mounted resources are usually hidden;
        they never appear during a read() of /etc/mnttab, and are
        visible to ioctl() only when specifically requested.  The
        value returned by MNTIOC_NMNTS will therefore be viewed by
        the majority of consumers as an over-estimate of the number
        of mounted resources. In reality, the value obtained by
        MNTIOC_NMNTS will be defined as the upper-limit on the number
        of mounted resources, and should be used only to determine
        the length of the array passed to MNTIOC_GETDEVLIST.
        
        MNTIOC_GETDEVLIST will, following the proposed changes,
        populate the supplied array with the major and minor
        numbers of only those mouted resources that are
        visible to the user. Typically, hence, this will leave
        many entries in the supplied array undefined. With
        the proposed changes, the MNTIOC_GETDEVLIST ioctl()
        itself will return the number of mounted resources,
        and hence the number of meaningful entries in the
        supplied array. In the current mntfs implementation,
        an ioctl() for MNTIOC_GETDEVLIST does not employ
        its return value for anything other than to indicate
        an error.
        
        In theory, then, this change introduces a backwards
        incompatability: existing code that uses MNTIOC_NMNTS and
        then MNTIOC_GETDEVLIST to obtain the major and minor numbers
        of mounted resources will find that the last entries are
        meaningless. However, MNTIOC_GETDEVLIST has not worked since
        S10 FCS: it now returns nonsense, as described in 6814666.
        
        Implementing the proposed changes calls for additions to the
        zone_t and vfs_t structs. The zone_t will acquire a pointer
        to an avl_tree_t, and the vfs_t will acquire a pointer to a
        newly-defined structure. The purpose is to allow each vfs_t
        to be stored in an AVL tree, sorted by a unique
        high-resolution time. This is to allow rapid location of the
        next available vfs_t in the mnttab table. If its predecessor
        were unmounted then there would be no vfs_next pointer to
        follow, and a linear search would otherwise be required from
        the start of the circularly-linked list.
        
4.2 Interface changes:

        1. The MNTIOC_GETDEVLIST command is modified so that the
           calling ioctl() returns the number of mounted resources
           represented in the supplied array, which is the same
           as the number of visible resources mounted on the system.
           This interface will be Uncommitted.
           
        2. The vfs struct acquires a new member, vfs_mntmeta, which
           is a pointer to a new, private structure with type
           'struct vfs_mntmeta'. The new member and the private
           structure will constitute a Private interface.
           
        3. The zone struct acquires a new member, zone_vfstree,
           which is a pointer to an avl_tree_t. The new member
           will constitute a Private interface.


5. Release binding:

        Patch.


6. Documentation impact:

        Changes to the mnttab(4) and getmntent(3C) man pages:
        
*** mnttab.old  Thu Jun 11 14:40:19 2009
--- mnttab.new  Thu Jun 11 14:38:09 2009
***************
*** 47,66 ****
  IOCTLS
       The following ioctl(2) calls are supported:
  
!      MNTIOC_NMNTS         Returns the count of mounted  resources
!                           in the current snapshot in the uint32_t
!                           pointed to by arg.
  
!      MNTIOC_GETDEVLIST    Returns an array of uint32_t's that  is
!                           twice as long as the length returned by
!                           MNTIOC_NMNTS. Each pair of  numbers  is
!                           the  major  and minor device number for
!                           the file system  at  the  corresponding
!                           line   in   the   current   /etc/mnttab
!                           snapshot.  arg  points  to  the  memory
!                           buffer  to  receive  the  device number
!                           information.
  
       MNTIOC_SETTAG        Sets a tag word into the  options  list
                            for  a  mounted file system. A tag is a
                            notation  that  will  appear   in   the
--- 47,87 ----
  IOCTLS
       The following ioctl(2) calls are supported:
  
!      MNTIOC_NMNTS         Obtains the upper limit on  the  number
!                           of  mounted  resources. arg points to a
!                           uint32_t; this will be set to the upper
!                           limit   on   the   number   of  mounted
!                           resources that will be identified by  a
!                           subsequent MNTIOC_GETDEVLIST.
  
!      MNTIOC_GETDEVLIST    Obtains the actual  number  of  mounted
!                           resources,  together  with  their major
!                           and minor numbers.  arg  points  to  an
!                           array  of uint_ts that must be at least
!                           twice as long as the length obtained by
!                           MNTIOC_NMNTS.  The array will contain a
!                           pair  of  numbers  for   each   mounted
!                           resource,   comprising  its  major  and
!                           minor numbers.
  
+                           A resource will not be  represented  in
+                           the  array  if it was mounted after the
+                           preceding MNTIOC_NMNTS command.  It  is
+                           an   error   to  use  MNTIOC_GETDEVLIST
+                           without having first used MNTIOC_NMNTS.
+ 
+                           The number of mounted  resources  actu-
+                           ally  represented  in the array will be
+                           returned by the call to ioctl() itself.
+                           The values of any remaining elements of
+                           the array are undefined.
+ 
+                           A  process   that   has   used   either
+                           MNTIOC_NMNTS  or MNTIOC_GETDEVLIST must
+                           call       resetmnttab(3C)       before
+                           getmntent(3C),    getextmntent(3C)   or
+                           getmntany(3C).
+ 
       MNTIOC_SETTAG        Sets a tag word into the  options  list
                            for  a  mounted file system. A tag is a
                            notation  that  will  appear   in   the
***************
*** 101,109 ****
                       location.
  
       EINVAL          The tag specified in  a  MNTIOC_SETTAG  call
!                      already  exists  as a file system option, or
!                      the tag specified in  a  MNTIOC_CLRTAG  call
!                      does not exist.
  
       ENAMETOOLONG    The tag specified in a MNTIOC_SETTAG call is
                       too  long  or  the  tag would make the total
--- 122,132 ----
                       location.
  
       EINVAL          The tag specified in  a  MNTIOC_SETTAG  call
!                      already  exists as a file system option, the
!                      tag specified in a MNTIOC_CLRTAG  call  does
!                      not exist or a request for MNTIOC_GETDEVLIST
!                      was  made  without  a  prior   request   for
!                      MNTIOC_NMNTS.
  
       ENAMETOOLONG    The tag specified in a MNTIOC_SETTAG call is
                       too  long  or  the  tag would make the total
***************
*** 144,156 ****
       ments.
  
  NOTES
!      The snapshot of the mnttab information is taken any  time  a
!      read(2)  is  performed  at  offset  0 (the beginning) of the
!      mnttab file. The file modification time returned by  stat(2)
!      for  the  mnttab  file  is  the  time  of the last change to
!      mounted file  system  information.  A  poll(2)  system  call
!      requesting  a POLLRDBAND event can be used to block and wait
!      for the system's mounted file system information to be  dif-
!      ferent  from  the most recent snapshot since the mnttab file
!      was opened.
  
--- 167,204 ----
       ments.
  
  NOTES
!      During a call to read(2) of /etc/mnttab,  the  corresponding
!      in-kernel  information cannot change. However, it will do so
!      between  successive  calls  to  read(2)  if,  for   example,
!      resources  are unmounted. The underlying file system, mntfs,
!      implements two features to ensure that /etc/mnttab will con-
!      tain  sensible  data  even  if  there are changes to the in-
!      kernel table of mounted resources.
! 
!      Firstly, if a call to read(2) terminates only  part  of  the
!      way through a line, then the next call to read(2) will start
!      by reading the remainder of the interrupted  line,  even  if
!      the  corresponding resource has been unmounted in the inter-
!      vening time.
! 
!      Secondly, successive calls to read(2) will  return  0  after
!      reading  the newest resource that was mounted at the time of
!      the first call to read(2), even if, in the intervening time,
!      additional   resources  have  been  mounted  and  are  still
!      present.
! 
!      Following  a  rewind(3C)  of  /etc/mnttab,  or  a  call   to
!      resetmnttab(3C), the next call to read(2) will be considered
!      the first: any saved remainder will  be  discarded  and  all
!      resources  mounted  at  that time are eligible to be read by
!      subsequent calls to read(2). /etc/mnttab  does  not  support
!      the  use of a file offset for any purpose other than rewind-
!      ing the file.
! 
!      The file modification  time  returned  by  stat(2)  for  the
!      mnttab  file  is the time of the last change to mounted file
!      system information.  A  poll(2)  system  call  requesting  a
!      POLLRDBAND  event  can  be  used  to  block and wait for the
!      system's mounted file system  information  to  be  different
!      from that at the time of the first read(2) of mnttab.
  
        
*** getmntent.old       Thu Jun 11 14:37:35 2009
--- getmntent.new       Thu Jun 11 14:41:24 2009
***************
*** 40,51 ****
  
       Each getmntent() call causes a new line to be read from  the
       mnttab  file.  Successive  calls  can  be used to search the
!      entire list. The  getmntany()  function  searches  the  file
!      referenced  by  fp  until a match is found between a line in
!      the file and mpref. A match occurs if all  non-null  entries
!      in  mpref  match the corresponding fields in the file. These
!      functions do not open, close, or rewind the file.
  
    getextmntent()
       The getextmntent() function is an extended  version  of  the
       getmntent() function that returns, in addition to the infor-
--- 40,58 ----
  
       Each getmntent() call causes a new line to be read from  the
       mnttab  file.  Successive  calls  can  be used to search the
!      entire list, although mnttab entries  added  by  the  kernel
!      after the first call to getmntent() will be ignored. Follow-
!      ing a call to resetmnttab(), the next  call  to  getmntent()
!      will  be considered the first: all resources mounted at that
!      time will be eligible to be  read  by  subsequent  calls  to
!      getmntent().
  
+      The getmntany() function searches the file referenced by  fp
+      until a match is found between a line in the file and mpref.
+      A match occurs if all non-null entries in  mpref  match  the
+      corresponding  fields  in  the  file. These functions do not
+      open, close, or rewind the file.
+ 
    getextmntent()
       The getextmntent() function is an extended  version  of  the
       getmntent() function that returns, in addition to the infor-
***************
*** 53,63 ****
       of  the  mounted  resource  to  which  the  line  in  mnttab
       corresponds. The getextmntent() function also fills  in  the
       extmntent  structure  defined  in the <sys/mnttab.h> header.
!      For getextmntent() to function properly, it must be notified
!      when  the  mnttab  file has been reopened or rewound since a
!      previous getextmntent() call.  This notification  is  accom-
!      plished  by  calling  resetmnttab().  Otherwise,  it behaves
!      exactly as getmntent() described above.
  
       The data pointed to by  the  mnttab  structure  members  are
       stored  in  a  static  area  and  must be copied to be saved
--- 60,67 ----
       of  the  mounted  resource  to  which  the  line  in  mnttab
       corresponds. The getextmntent() function also fills  in  the
       extmntent  structure  defined  in the <sys/mnttab.h> header.
!      Otherwise,  it  behaves  exactly  as  getmntent()  described
!      above.
  
       The data pointed to by  the  mnttab  structure  members  are
       stored  in  a  static  area  and  must be copied to be saved
***************
*** 77,89 ****
       sition purposes.
  
    resetmnttab()
!      The resetmnttab() function notifies getextmntent() to reload
!      from  the  kernel the device information that corresponds to
!      the new snapshot of the mnttab information (see  mnttab(4)).
!      Subsequent   getextmntent()   calls   then   return  correct
!      extmnttab information. This function should be called  when-
!      ever  the  mnttab  file is either rewound or closed and reo-
!      pened before any calls are made to getextmntent().
  
  RETURN VALUES
    getmntent() and getmntany()
--- 81,91 ----
       sition purposes.
  
    resetmnttab()
!      The  resetmnttab()  function  causes  the   next   call   to
!      getmntent(),  getextmntent()  or  getmntany()  to  behave as
!      though /etc/mnttab had just been opened. In  addition,  this
!      function   will  have  a  similar  effect  on  read(2);  see
!      mnttab(4) for more details.
  
  RETURN VALUES
    getmntent() and getmntany()
        

7. References:

1. CR 6394241 mntfs is not exec safe

2. CR 6813502 mntfs will leak mappings when called from a forking MT program.

3. CR 6814666 MNTIOC_GETDEVLIST produces nonsense

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open


Reply via email to