On Thu, Jan 4, 2024 at 4:30 AM Rob Landley <r...@landley.net> wrote:
> I note that I've written over a hundred lines of rant in response to his
> previous email already. I should dig back through this and turn it into proper
> documentation at some point. (Especially since Elliott knows more of this 
> stuff
> than I do so I'm likely to get corrected a lot here...)
> On 1/2/24 20:54, enh wrote:
> >> You can look at /proc/self/maps (and /proc/self/smaps, and
> >> /proc/self/smaps_rollup) to see them for a running process (replace "self" 
> >> with
> >> any running PID, self is a symlink to your current PID). The six sections 
> >> are:
> >>
> >>   text - the executable functions: mmap(MAP_PRIVATE, PROT_READ|PROT_EXEC)
> >>   rodata - const globals, string constants, etc: mmap(MAP_PRIVATE, 
> >>   data - writeable data initialized to nonzero: mmap(MAP_PRIVATE, 
> >>   bss - writeable data initialized to zero: mmap(MAP_ANON, PROT_WRITE)
> >>   stack - function call stack, also contains environment data
> >>   heap - backing store for malloc() and free()
> >
> > (Android and modern linux distros require the relro section too.
> I thought that was only needed for dynamic linking? Then again you don't 
> allow a
> lot of static stuff to run on the final system anyway...
> (The line between PIE and dynamic linking confuses even me. How does static 
> relocate itself? I _think_ I looked it up once, but "it's statically links in 
> a
> dynamic linker in the pile of crt1.o and begin.o files" _can't_ be right...)
> > interestingly, there _is_ an elf program header for the stack, to
> > signal that you don't want an executable stack. iirc Android and [very
> > very recently] modern linux distros won't let you start a process with
> > an executable main stack, but afaik the code for the option no-one has
> > wanted/needed for a very long time is still in the kernel.)
> Cool.
> These days there's also vdso and vvar, which are provided by the kernel at
> runtime. The first is a .text section with magic functions you can call as an
> alternative to syscalls, and the second is a magic .rodata section that 
> provides
> volatile variables the OS updates which you can just reach out and look at.
> Between the two of them you can do things like check the current timestamp
> without a system call. What they actually provide varies by OS (and then your
> libc has to be taught to use each new capability out of there instead of 
> making
> the syscalls).
> "cat /proc/self/maps" and they're the last two entries if present.
> There is a "man 7 vdso" but I dunno how up to date it is. (Which gets us back 
> to
> Michael Kerrisk's retirement and the new guy NOT MAINTAINING A WEB COPY. 
> Grrr.)
> Maintaining backwards compatibility means keeping a lot of old stuff. I had a
> talk with Rich Felker last night on IRC about what musl-libc's syscall
> requirements actually _are_, and what it would take to repot it on top of a
> posix-ish RTOS du jour. (Makes the trusting trust cleansing cycle smaller if 
> you
> can cross compile Linux from an RTOS...)

I did the "run linux-musl binaries on an RTOS" part a few years ago
and ended up with this list:


It's by no means exhaustive, but it was enough to run a useful set of
toybox toys, busybox's ash and enough other stuff to build a
commercial product running on an armv7-m (nommu) chip on top of it. I
had a risc-v port working and was in the middle of getting powerpc
(mmu) stuff running when circumstances changed and I had to move on.

I'm not sure how many more syscalls would be required to be able to
compile Linux, but probably not a whole lot.


> We didn't come to a conclusion, but I _did_ get permission from skarnet to use
> his git://git.skarnet.org/mdevd under 0BSD. (POrting that to toybox seems 
> easier
> than bringing my old mdev code up to speed for all the
> https://github.com/slashbeast/mdev-like-a-boss stuff it's grown since I handed
> it off.
> >> The first three of those literally exist in the ELF file, as in it mmap()s 
> >> a
> >> block of data out of the file at a starting offset, and the memory is thus
> >> automatically populated with data from the file. The text and rodata ones 
> >> don't
> >> really care if it's MAP_PRIVATE or MAP_SHARED because they can never write
> >> anything back to the file, but the data one cares that it's MAP_PRIVATE: 
> >> any
> >> changes stay local and do NOT get written back to the file. And the bss is 
> >> an
> >> anonymous mapping so starts zeroed, the file doesn't bother wasting space 
> >> on a
> >> run of zeroes when the OS can just provide that on request. (It stands for 
> >> Block
> >> Starting Symbol which I assume meant something useful 40 years ago on DEC 
> >> hardware.)
> >
> > (close, but it was IBM and the name was slightly different:
> > https://en.wikipedia.org/wiki/.bss#Origin)
> That says United Aircraft Corporation named it using IBM 704 hardware in an
> assembler and then in fortran. (I only give wikipedia[citation needed] about 
> an
> 80% chance to be accurate about any given fact, but am not root causing it 
> right
> now. :)
> I like to track down magic acronyms, ala grep meaning "get regular 
> expression".
> I once emailed Dennis Ritchie to ask what "inode" meant:
> https://lkml.iu.edu/hypermail/linux/kernel/0207.2/1182.html
> But in this case I stopped paying attention once I confirmed it doesn't mean
> anything of modern relevance.
> The interesting part (to me) is that the name predates unix by almost 20 years
> (mainframe legacy predating even the PDP-1), and predating ELF by 40 years. 
> (The
> first OS with ELF binaries was Solaris 2.0 released in 1992. Linux switched 
> over
> 3-4 years later.)
> If it wasn't a legacy acronym from shortly after world war II, it would 
> probably
> be called something like the "zero section" and we wouldn't have to memorize
> what it means. :)
> >> The stack is also set up by the kernel, and is funny in three ways:
> >>
> >> 1) it has environment data at the end (so all your inherited environment
> >> variables, and your argv[] arguments, plus an array of pointers to the 
> >> start of
> >> each string which is what char *argv[] and char *environ[] actually point 
> >> to.
> >> The kernel's task struct also used to live there, but these days there's a
> >> separate "kernel stack" and I'd have to look up where things physically 
> >> are now
> >> and what's user visible.
> >
> > (plus the confusingly named "ELF aux values", which come from the
> > kernel, and aren't really anything to do with ELF --- almost by
> > definition, since they're things that the binary _can't_ know like
> > "what's the actual page size of the system i'm _running_ on?" or
> > "what's the l1d cache size of the system i'm _running_ on?".)
> Are they in the stack? I know the pointer is passed to _start() (often not in 
> a
> proper argument, in a REGISTER), but hadn't tracked down where it actually
> lived. Stack makes sense...
> Sadly, I have had to care about the auxiliary vector on far too many 
> occasions:
> man 3 getauxval
> >> 3) The stack generally has _two_ pointers, a "stack pointer" and a "base
> >> pointer" which I always get confused. One of them points to the start of 
> >> the
> >> mapping (kinda important to keep track of where your mappings are), and the
> >> other one moves (gets subtracted from and added to and offset to access 
> >> local
> >> variables).
> >
> > (s/base pointer/frame pointer/ for everything except x86. and actually
> > _both_ change. it's the "base" of the current stack _frame_, not the
> > whole stack. for a concrete example: alloca() changes the stack
> > pointer, but not the frame pointer. so local variables offsets
> > relative to fp will be constant throughout the function, whereas
> > offsets relative to sp can change. [stacked values of] fp is also what
> > you're using when you're unwinding.)
> I only implemented alloca() for my tinycc fork on 32-bit x86, and that was 
> back
> in 2008.
> I'm hoping to sit on tonight's https://meet.jit.si/golug at 6pm about 
> creating a
> compiler with a recursive descent parser, and someday hope to read
> https://norasandler.com/2017/11/29/Write-a-Compiler.html and the corresponding
> https://nostarch.com/writing-c-compiler and https://github.com/nlsandler/nqcc
> but right now restarting my https://landley.net/code/qcc is not even on the 
> back
> burner...
> >> All this is ignoring dynamic linking, in which case EACH library has those 
> >> first
> >> four sections (plus a PLT and GOT which have to nest since the shared 
> >> libraries
> >> are THEMSELVES dynamically linked, which is why you need to run ldd 
> >> recursively
> >> when harvesting binaries, although what it does to them at runtime I try 
> >> not to
> >> examine too closely after eating). There should still only be one stack 
> >> and heap
> >> shared by each process though.
> >
> > (one stack _per thread_ in the process. and the main thread stack is
> > very different from thread stacks.)
> A thread is a process with brain damage inherited from solaris' limitations, 
> but
> you're right. I just mentally gloss over threads as "process with training
> wheels and 5x the debugging effort".
> Even before the ~7 year period where I thought java was a good idea, I had to
> use threading VERY EXTENSIVELY on OS/2. The "workplace shell" desktop was a
> single process with many, many threads, so any desktop programming there meant
> creating a shared library the workplace shell process would dlopen() and 
> launch
> threads for. I got very, very good at debugging thread issues, once upon a 
> time.
> (And I've debugged a lot of OTHER people's threading issues as a consultant. 
> The
> oil exploration company that bought three different programs and mushed them
> together into a single highly threaded process that leaked like a sieve and
> segfaulted randomly. The 2018 project that replaced WinCE with Linux when
> microsoft end-of-lifed wince, resulting in an 80 thread application process,
> half of which were C# code running in mono and the other half were linux 
> native
> code sharing the same address space, and the PROBLEM was on the ~200 mhz
> deployment hardware they had a warehouse full of and wanted to keep selling,
> fork() caused a 75 millisecond latency spike in EVERY OTHER THREAD because the
> kernel took one look at that mess and locked the whole vma until fork() had
> finished copying everything, which meant a thread spawning a child process
> caused the token-ring-like bus to timeout and drop connection. Which meant I 
> got
> to do a real world use of vfork() on a system with an MMU, because that only
> suspends the PARENT thread, not all the other threads in the process, and
> vfork()/exec() isn't much that much harder to program around than 
> fork()/exec().)
> My modern reaction to dealing with threads is...
> https://www.youtube.com/watch?v=hlVwbpm4eHI
> They're SOMETIMES the right tool for the job? Occasionally? Maybe?
> >> If you launch dozens of instances of the same program, the read only 
> >> sections
> >> (text and rodata) are shared between all the instances. (This is why nommu
> >> systems needed to invent fdpic: in conventional ELF everything uses 
> >> absolute
> >> addresses, which is find when you've got an MMU because each process has 
> >> its own
> >> virtual address range starting at zero. (Generally libc or something will 
> >> mmap()
> >> about 64k of "cannot read, cannot write, cannot execute" memory there so 
> >> any
> >> attempt to dereference a NULL pointer segfaults, but other than that...)
> >>
> >> But shared libraries need to move so they can fit around stuff. Back in the
> >> a.out days each shared library was also linked at an absolute address 
> >> (just one
> >> well above zero, out of the way of most programs), meaning when putting 
> >> together
> >> a system you needed a registry of what addresses were used by each 
> >> library, and
> >> you'd have to supply an address range to each library you were building as 
> >> part
> >> of the compiler options (or linker script or however that build did it). 
> >> This
> >> sucked tremendously.
> >
> > (funnily enough, this gets reinvented as an optimization every couple
> > of decades. iirc macOS has "prelinking" again, but Android is
> > currently in the no-prelinking phase of the cycle.)
> The old line about how there are two hard problems in computer science: naming
> things, cache invalidation, and fencepost errors. This falls under 'cache
> invalidation", which more generically is "object lifetime rules".
> The really FUN one is the horrible trick people did on various embedded 
> systems
> for fast boot, or on OpenVZ as part of the live migration, where they'd
> basically core dump a process, load it into a debugger, and resume. Thus
> skipping all the setup! (Assuming NOTHING HAS CHANGED in the context the 
> resumed
> process expects around it. Luckily X11 has "detach and restart" plumbing that
> lets it reopen a process's network pipe without killing the window or the
> process, because network connections hanging and needing retry isn't a new 
> thing.)
> Sigh, I did a whole rant about what would be involved in kernel upgrades 
> without
> reboots way back in 2002:
> https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0610.html
> https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0835.html
> https://lkml.iu.edu/hypermail/linux/kernel/0206.2/1244.html
> And I was just going "this is _hard_" but people tracked me down from that and
> had me help IMPLEMENT some of that stuff over the years. The hard part was 
> that
> processes act in GROUPS: parent/child relationships and pipelines and so on, 
> and
> the kernel had no way to group processes. Enter "container" support, and me
> helping the parallels/OpenVZ guys explain _why_ the kernel could benefit from
> it. (The number of times I've been hired as a programmer and wound up spending
> most of my energy as a combination tech writer and marketer...)
> Sigh, I gotta go get on an airplane now, so stopping here for the moment...
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox@lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
Toybox mailing list

Reply via email to