Michael Corcoran wrote:
> On Mon, 2008-03-31 at 13:37 +0200, Roland Mainz wrote:
> > Rod Evans wrote:
> > > I'm sponsoring the following case for Mike Corcoran.  Time out 04/07/08.
> > >
> > > The case introduces a new system call, mmapfd(2).  This call is primarily
> > > targeted for use by ld.so.1(1), and provides for the efficient mapping of
> > > ELF files (and 4.x AOUT files).
> > >
> > > Release Binding:                 Patch/Micro
> > > mmapfd(2):                       Consolidation Private
> > >
> > > --------------------------------------------------------------------------
> > >
> > > 1. Introduction
> > >     1.1. Project/Component Working Name:
> > >          mmapfd: mmap file descriptor
> > [snip]
> > > 4. Technical Description:
> > >      4.1. Details:
> > >          mmapfd is a new system call which can interpret and map ELF and 
> > > AOUT
> > >          (4.x) objects.  This system call allows the interpretation and 
> > > mapping
> > >          of ELF and AOUT files to be carried out completely by the kernel 
> > > rather
> > >          than by ld.so.1.
> >
> > Erm... the call seems to be ELF+AOUT-specifc - why does it have such a
> > generic name ?
> 
> It has a generic name for future expansion.

Yeah... but |mmapfd()| sounds it's for generic usage and what's
described here is IMO very specific and therefore a longer (and more
specific) name may be nice...

> > The call can't handle other types of executables (e.g.
> > "javaexec" or "shbinexec")
> 
> Not yet, but that doesn't mean that it won't.  If there is a file format
> that needs to be interpreted in order to be mapped, this new syscall is
> a good way of going about doing that.

What is the exact definition of "interpreted" in this case - and why is
the kernel needed for this ? Just for sharing the interpreted data
between processes, e.g. to avoid that multiple processes waste memory
with multiple copies of the same (interpreted) data ?

[snip]
> > ... what renaming the call to |mmapexecfd()| (= "memory map of
> > executable fd") ?
>
> I like Darren's suggestion of mmapv(2) better so far.  Why limit us to
> only being able to interpret executable files?  What if there's some
> other non-executable file type that would naturally use this interface?

Mhhh... Ok... but |mmapv()| somehow sounds like a vector version of
|mmap()| (e.g. something which accepts an array of |mmap()| options and
handles them all in one single syscall) ...
... what about |mmapintp()| ("mmap interpreted") ?

> > >          mmapfd also provides for mapping a whole file, without 
> > > interpretation
> > >          in a read only mode.
> >
> > What does that mean ? Can these data+MMU mappings be shared between
> > processes ?
>
> It means that you can map a file read-only without doing any
> interpretation of the file.  Thus if you passed in an ELF file without
> the MMFD_INTERPRET flag, the file would get mapped as a single read-only
> segment.

Ok...

> The data+MMU mappings would not be shared between processes.

Why ?

> > [snip]
> > > System Calls                                                     mmapfd(2)
> > >
> > >   NAME
> > >
> > >    mmapfd - map a file descriptor in the appropriate manner.
> > >
> > >   SYNOPSIS
> > >
> > >   #include <sys/mman.h>
> > >
> > >   int
> > >   mmapfd(int fd, uint_t flags, mmapfd_result_t *storage,
> > >         uint_t *elements, void *arg)
> >
> > Uhm... how do I unmap the mapping done by |mmapfd()| ?
> >
> Nico answered this in a subsequent post and you would use munmap(2) on
> each element in the "storage" array to unmap that segment.  I agree that
> this should be explicitly pointed out somewhere since it's not clear.

What about having a helper function to unmap the storage array in one
step (IMO it would be nice and it would maintain the Unix "tradition" of
symmetric function call pairs (e.g. |open()|+|close()|,
|malloc()|+|free()| etc.) ?

[snip]
> > >   TYPES USED
> > >
> > >    typedef struct {
> > >          caddr_t         mr_addr;         /* mapping address */
> > >          size_t          mr_msize;        /* mapping size */
> > >          size_t          mr_fsize;        /* file size */
> > >          size_t          mr_offset;       /* offset into file */
> > >          int             mr_prot;         /* the protections provided */
> >
> > Why is this |signed int| ?
>
> I agree it should be uint_t or even uchar_t.  Both appear to be commonly
> used throughout the kernel. uint_t is my preference here.

Ok...
 
> > >          uint_t          mr_flags;        /* info on the mapping */
> >
> > Please change this to |uint64_t| (e.g. it may be nice to have more flags
> > available by default).
> >
> I go back and forth on this.  uint64_t implies that 32 flags will not be
> enough.  32 seems like a lot

Yeah... but after some time the number of flags gets used-up...

> but since this interface should be around
> for a long time, maybe 64 would be better since it will take a long time
> to use all 64 :)

IMO 64 is better (and IMO each system/library call used to create a
resource should take a "flags" option that the call can be
changed/extended easily) ...

> > >    } mmapfd_result_t;
> > >
> > >    Values for mr_flags include:
> > >
> > >    MFD_ELF_HDR           0x1     /* the ELF header is mapped at mr_addr */
> > >    MFD_AOUT_HDR          0x2     /* the AOUT header is mapped at mr_addr 
> > > */
> >
> > What about reserving the first four bits for |MFD_*_HDR| flags
> > (|MFD_ELF_HDR|, |MFD_AOUT_HDR| and two bits reserved for future
> > |MFD_*_HDR| flags) ?
>
> I can see the vanity reason for doing this, it's nice to clump things in
> groups of 4 with everything in the group being related, but at the same
> time, why the first 4 flags and not the first 8 or 16 or ...  I think
> densely packing the bits is easiest.
> I'm willing to move MFD_PADDING
> first in the list aka 0x1 so that future header types will follow
> numerically from the current header types.  I'll make this change since
> it is aesthetically more pleasing.

Ok...

> > Finally... how can I _force_ the mapping to use something like 64k pages
> > by default ?
> 
> Using mpss.so.1 or ppgsz are ways to get the heap segment to be a
> specific size.  Other than that, there is no control for the page size
> used by the other segments and the kernel will pick the best size for
> the given platform.

Uufortunately (at least tested with B72) the kernel seems to do a bad
job in this case (e.g. the stack won't be mapped with 64k pages unless
the size significantly exceeds 64k in size and code doens't get mapped
wth 64k pages at all), even on platforms (like the T2000 we had on the
Oct 2007 Summit)  where I would expect extensive use of 64k pages... ;-(
... IMO a manual control would be nice...

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Reply via email to