Since some confusion was expressed to me about the recent lib/csu/*/crt0.c 
changes, I figured I should explain more broadly what that code does (or 
is supposed to do) and why.

It's a bit obscure, but we can use more people understanding this stuff.  
For example, it would be nice if someone looked at teaching crt0 to 
relocate static PIE executables.  Anyway...



When we execve() an executable, the kernel has to do a bunch of things.  
Many of them deal with purely kernel data structures can be described in 
nice MI ways, such as the closing of FD_CLOEXEC fds in the process.  Even 
the set up of the process's new address space is mostly MI , with the 
layout of the stack being the one big exception.  For most archs, the 
stack grows down and we lay it out like this:

        struct ps_strings
        stack gap (random size + alignment)
        string of environment assignment n
        ...
        string of environment assignment 1
        string of environment assignment 0
        string of argument n
        ...
        string of argument 1
        string of argument 0
        auxv[n] (AUX_null)
        ...
        auxv[1] (AUX_phent)
        auxv[0] (AUX_phdr)
        NULL
        environ[n]
        ...
        environ[1]
        environ[0]
        NULL
        argv[n]
        ...
        argv[1]
        argv[0]
        word holding argc

For hppa, where the stack grows upwards, ps_strings and the gap are
at the bottom, but the rest is mostly unchanged.  The kernel
(kern/kern_exec.c and kern/exec_elf.c) handle the writing out of
all that information.

Okay, we've got memory image; what should the registers be?  How
do we set the registers so that, from the C perspective, we end up
with a call to main(arcgc, argv, env)?  Note that the location of
main() varies from program to program and from invocation to
invocation.  And how do we get that to work in both the static and
dynamicly linked case?

The good news is that the ELF people have specs for this, but it's
still a bunch of work.  For each arch, the ELF "process-specific
ABI" specifies some of the initial register values for the process,
such as the floating point state and the stack pointer.  But what's
the initial program counter / instruction pointer?

For static executables, that comes from the ELF header for the
process: the e_entry member is the address of the entry point of
the executable.  The kernel's MD setregs() routine initializes the
process initial thread to start with that as its program counter.

The e_entry value is set by the linker to the value of the symbol
passed to it via the -e option, or via the ENTRY() declaration in
the linker script.  For almost all our archs that's "__start", with
vax and *ppc using "_start" instead.  (Yes, we would like to use
"__start" everywhere.)  The __start (or _start) routine is defined
by /usr/lib/crt0.o, which is automatically included in the link by
gcc.  That's currently compiled from /usr/src/lib/csu/${ARCH}/crt0.c
and generally consists of three chunks:
3) the environ and __progname global and storage for __progname to
   point to
2) a C routine that
   - sets the environ and __progname globals from environment and
     arguments on the stack
   - optionally register a cleanup function with atexit
   - in profiling builds:
       - register a call to _mcleanup with atexit to finish profiling
       - enable profiling
   - call the executable's own initializer (constructor) functions
   - invoke main() and pass the return value to exit()
1) on most platforms, an ASM stub for __start that maps the registers
   specified by the ELF processor-specific ABI doc into arguments that
   can be handled by the C routine above in (2).


For dynamic executables, it's a bit more complex: the executable
specifies an "interpreter" by including a PT_INTERP segment.  The
kernel sees that and *also* loads the interpreter into memory, and
then instead of starting the process at the e_entry of the executable,
it starts it at the e_entry *of the interpreter*.  The interpreter
then looks at the auxinfo entries on the stack to find the real
program and after doing whatever setup was requested, jumps to the
e_entry of the executable.

Now comes the tricky part: what if the interpreter wants to do
something on process *exit*?  For example, it may need to call
shared library destructors.  How can it get back control then?  The
good answer is that it should pass the e_entry routine of the
executable the function pointer of a callback to be invoked on
process exit.  And indeed, that's what the ELF processor ABIs
specify.  For example, the amd64 ABI says that on process startup,
register %rdx contain either zero or the address of a function to
be invoked at process exit.  It's the responsibility of the code
in crt0.c to pass that ponter to atexit() when non-zero, thus that
line item above that the code in crt0.c should "optionally register
a cleanup function with atexit".


Unfortunately, we don't actually do all that right now.  Our crt0's
didn't follow the ABI by register the cleanup pointer passed to it,
and so our ld.so didn't expect to pass one.  Instead, our ld.so
directly peaks into the process's link map and looks for an atexit()
function, and then itself calls that function with its callback for
invoking the shared library destructors.  That has three downsides:
 1) it's not ABI compliant
 2) it fails if you built a dynamic executable that linked libc
    staticly and the atexit() function wasn't pulled in during the
    link
 3) it's fucking gross

So, last year, kettenis fixed ld.so on each arch to set up the
registers when calling the executable e_entry to be as specified
by the ELF ABI docs, but passing NULL for the cleanup pointer.
(alpha got missed, but that's been fixed.)

So now that ld.so and kernel are both passing NULL for the cleanup
pointer, we've been fixing crt0 to follow the ABI and pass that
pointer to atexit() when its non-NULL (i.e., never).

Once all the supported binaries are built with a crt0 supporting
that (remember that crt0.o is staticly linked into every executable),
we can change ld.so to stop calling atexit() itself and instead
pass its cleanup function to the executable's e_entry routine as
specified by the ABI.


Philip Guenther

Reply via email to