On Thu, Jan 4, 2024 at 4:30 AM Rob Landley <r...@landley.net> wrote: > > I note that I've written over a hundred lines of rant in response to his > previous email already. I should dig back through this and turn it into proper > documentation at some point. (Especially since Elliott knows more of this > stuff > than I do so I'm likely to get corrected a lot here...) > > On 1/2/24 20:54, enh wrote: > >> You can look at /proc/self/maps (and /proc/self/smaps, and > >> /proc/self/smaps_rollup) to see them for a running process (replace "self" > >> with > >> any running PID, self is a symlink to your current PID). The six sections > >> are: > >> > >> text - the executable functions: mmap(MAP_PRIVATE, PROT_READ|PROT_EXEC) > >> rodata - const globals, string constants, etc: mmap(MAP_PRIVATE, > >> PROT_READ) > >> data - writeable data initialized to nonzero: mmap(MAP_PRIVATE, > >> PROT_WRITE) > >> bss - writeable data initialized to zero: mmap(MAP_ANON, PROT_WRITE) > >> stack - function call stack, also contains environment data > >> heap - backing store for malloc() and free() > > > > (Android and modern linux distros require the relro section too. > > I thought that was only needed for dynamic linking? Then again you don't > allow a > lot of static stuff to run on the final system anyway... > > (The line between PIE and dynamic linking confuses even me. How does static > PIE > relocate itself? I _think_ I looked it up once, but "it's statically links in > a > dynamic linker in the pile of crt1.o and begin.o files" _can't_ be right...) > > > interestingly, there _is_ an elf program header for the stack, to > > signal that you don't want an executable stack. iirc Android and [very > > very recently] modern linux distros won't let you start a process with > > an executable main stack, but afaik the code for the option no-one has > > wanted/needed for a very long time is still in the kernel.) > > Cool. > > These days there's also vdso and vvar, which are provided by the kernel at > runtime. The first is a .text section with magic functions you can call as an > alternative to syscalls, and the second is a magic .rodata section that > provides > volatile variables the OS updates which you can just reach out and look at. > > Between the two of them you can do things like check the current timestamp > without a system call. What they actually provide varies by OS (and then your > libc has to be taught to use each new capability out of there instead of > making > the syscalls). > > "cat /proc/self/maps" and they're the last two entries if present. > > There is a "man 7 vdso" but I dunno how up to date it is. (Which gets us back > to > Michael Kerrisk's retirement and the new guy NOT MAINTAINING A WEB COPY. > Grrr.) > > Maintaining backwards compatibility means keeping a lot of old stuff. I had a > talk with Rich Felker last night on IRC about what musl-libc's syscall > requirements actually _are_, and what it would take to repot it on top of a > posix-ish RTOS du jour. (Makes the trusting trust cleansing cycle smaller if > you > can cross compile Linux from an RTOS...)
I did the "run linux-musl binaries on an RTOS" part a few years ago and ended up with this list: https://github.com/apexrtos/apex/blob/master/sys/kern/syscall_table.c It's by no means exhaustive, but it was enough to run a useful set of toybox toys, busybox's ash and enough other stuff to build a commercial product running on an armv7-m (nommu) chip on top of it. I had a risc-v port working and was in the middle of getting powerpc (mmu) stuff running when circumstances changed and I had to move on. I'm not sure how many more syscalls would be required to be able to compile Linux, but probably not a whole lot. Patrick > We didn't come to a conclusion, but I _did_ get permission from skarnet to use > his git://git.skarnet.org/mdevd under 0BSD. (POrting that to toybox seems > easier > than bringing my old mdev code up to speed for all the > https://github.com/slashbeast/mdev-like-a-boss stuff it's grown since I handed > it off. > > >> The first three of those literally exist in the ELF file, as in it mmap()s > >> a > >> block of data out of the file at a starting offset, and the memory is thus > >> automatically populated with data from the file. The text and rodata ones > >> don't > >> really care if it's MAP_PRIVATE or MAP_SHARED because they can never write > >> anything back to the file, but the data one cares that it's MAP_PRIVATE: > >> any > >> changes stay local and do NOT get written back to the file. And the bss is > >> an > >> anonymous mapping so starts zeroed, the file doesn't bother wasting space > >> on a > >> run of zeroes when the OS can just provide that on request. (It stands for > >> Block > >> Starting Symbol which I assume meant something useful 40 years ago on DEC > >> hardware.) > > > > (close, but it was IBM and the name was slightly different: > > https://en.wikipedia.org/wiki/.bss#Origin) > > That says United Aircraft Corporation named it using IBM 704 hardware in an > assembler and then in fortran. (I only give wikipedia[citation needed] about > an > 80% chance to be accurate about any given fact, but am not root causing it > right > now. :) > > I like to track down magic acronyms, ala grep meaning "get regular > expression". > I once emailed Dennis Ritchie to ask what "inode" meant: > > https://lkml.iu.edu/hypermail/linux/kernel/0207.2/1182.html > > But in this case I stopped paying attention once I confirmed it doesn't mean > anything of modern relevance. > > The interesting part (to me) is that the name predates unix by almost 20 years > (mainframe legacy predating even the PDP-1), and predating ELF by 40 years. > (The > first OS with ELF binaries was Solaris 2.0 released in 1992. Linux switched > over > 3-4 years later.) > > If it wasn't a legacy acronym from shortly after world war II, it would > probably > be called something like the "zero section" and we wouldn't have to memorize > what it means. :) > > >> The stack is also set up by the kernel, and is funny in three ways: > >> > >> 1) it has environment data at the end (so all your inherited environment > >> variables, and your argv[] arguments, plus an array of pointers to the > >> start of > >> each string which is what char *argv[] and char *environ[] actually point > >> to. > >> The kernel's task struct also used to live there, but these days there's a > >> separate "kernel stack" and I'd have to look up where things physically > >> are now > >> and what's user visible. > > > > (plus the confusingly named "ELF aux values", which come from the > > kernel, and aren't really anything to do with ELF --- almost by > > definition, since they're things that the binary _can't_ know like > > "what's the actual page size of the system i'm _running_ on?" or > > "what's the l1d cache size of the system i'm _running_ on?".) > > Are they in the stack? I know the pointer is passed to _start() (often not in > a > proper argument, in a REGISTER), but hadn't tracked down where it actually > lived. Stack makes sense... > > Sadly, I have had to care about the auxiliary vector on far too many > occasions: > > man 3 getauxval > > >> 3) The stack generally has _two_ pointers, a "stack pointer" and a "base > >> pointer" which I always get confused. One of them points to the start of > >> the > >> mapping (kinda important to keep track of where your mappings are), and the > >> other one moves (gets subtracted from and added to and offset to access > >> local > >> variables). > > > > (s/base pointer/frame pointer/ for everything except x86. and actually > > _both_ change. it's the "base" of the current stack _frame_, not the > > whole stack. for a concrete example: alloca() changes the stack > > pointer, but not the frame pointer. so local variables offsets > > relative to fp will be constant throughout the function, whereas > > offsets relative to sp can change. [stacked values of] fp is also what > > you're using when you're unwinding.) > > I only implemented alloca() for my tinycc fork on 32-bit x86, and that was > back > in 2008. > > I'm hoping to sit on tonight's https://meet.jit.si/golug at 6pm about > creating a > compiler with a recursive descent parser, and someday hope to read > https://norasandler.com/2017/11/29/Write-a-Compiler.html and the corresponding > https://nostarch.com/writing-c-compiler and https://github.com/nlsandler/nqcc > but right now restarting my https://landley.net/code/qcc is not even on the > back > burner... > > >> All this is ignoring dynamic linking, in which case EACH library has those > >> first > >> four sections (plus a PLT and GOT which have to nest since the shared > >> libraries > >> are THEMSELVES dynamically linked, which is why you need to run ldd > >> recursively > >> when harvesting binaries, although what it does to them at runtime I try > >> not to > >> examine too closely after eating). There should still only be one stack > >> and heap > >> shared by each process though. > > > > (one stack _per thread_ in the process. and the main thread stack is > > very different from thread stacks.) > > A thread is a process with brain damage inherited from solaris' limitations, > but > you're right. I just mentally gloss over threads as "process with training > wheels and 5x the debugging effort". > > Even before the ~7 year period where I thought java was a good idea, I had to > use threading VERY EXTENSIVELY on OS/2. The "workplace shell" desktop was a > single process with many, many threads, so any desktop programming there meant > creating a shared library the workplace shell process would dlopen() and > launch > threads for. I got very, very good at debugging thread issues, once upon a > time. > (And I've debugged a lot of OTHER people's threading issues as a consultant. > The > oil exploration company that bought three different programs and mushed them > together into a single highly threaded process that leaked like a sieve and > segfaulted randomly. The 2018 project that replaced WinCE with Linux when > microsoft end-of-lifed wince, resulting in an 80 thread application process, > half of which were C# code running in mono and the other half were linux > native > code sharing the same address space, and the PROBLEM was on the ~200 mhz > deployment hardware they had a warehouse full of and wanted to keep selling, > fork() caused a 75 millisecond latency spike in EVERY OTHER THREAD because the > kernel took one look at that mess and locked the whole vma until fork() had > finished copying everything, which meant a thread spawning a child process > caused the token-ring-like bus to timeout and drop connection. Which meant I > got > to do a real world use of vfork() on a system with an MMU, because that only > suspends the PARENT thread, not all the other threads in the process, and > vfork()/exec() isn't much that much harder to program around than > fork()/exec().) > > My modern reaction to dealing with threads is... > > https://www.youtube.com/watch?v=hlVwbpm4eHI > > They're SOMETIMES the right tool for the job? Occasionally? Maybe? > > >> If you launch dozens of instances of the same program, the read only > >> sections > >> (text and rodata) are shared between all the instances. (This is why nommu > >> systems needed to invent fdpic: in conventional ELF everything uses > >> absolute > >> addresses, which is find when you've got an MMU because each process has > >> its own > >> virtual address range starting at zero. (Generally libc or something will > >> mmap() > >> about 64k of "cannot read, cannot write, cannot execute" memory there so > >> any > >> attempt to dereference a NULL pointer segfaults, but other than that...) > >> > >> But shared libraries need to move so they can fit around stuff. Back in the > >> a.out days each shared library was also linked at an absolute address > >> (just one > >> well above zero, out of the way of most programs), meaning when putting > >> together > >> a system you needed a registry of what addresses were used by each > >> library, and > >> you'd have to supply an address range to each library you were building as > >> part > >> of the compiler options (or linker script or however that build did it). > >> This > >> sucked tremendously. > > > > (funnily enough, this gets reinvented as an optimization every couple > > of decades. iirc macOS has "prelinking" again, but Android is > > currently in the no-prelinking phase of the cycle.) > > The old line about how there are two hard problems in computer science: naming > things, cache invalidation, and fencepost errors. This falls under 'cache > invalidation", which more generically is "object lifetime rules". > > The really FUN one is the horrible trick people did on various embedded > systems > for fast boot, or on OpenVZ as part of the live migration, where they'd > basically core dump a process, load it into a debugger, and resume. Thus > skipping all the setup! (Assuming NOTHING HAS CHANGED in the context the > resumed > process expects around it. Luckily X11 has "detach and restart" plumbing that > lets it reopen a process's network pipe without killing the window or the > process, because network connections hanging and needing retry isn't a new > thing.) > > Sigh, I did a whole rant about what would be involved in kernel upgrades > without > reboots way back in 2002: > > https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0610.html > https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0835.html > https://lkml.iu.edu/hypermail/linux/kernel/0206.2/1244.html > > And I was just going "this is _hard_" but people tracked me down from that and > had me help IMPLEMENT some of that stuff over the years. The hard part was > that > processes act in GROUPS: parent/child relationships and pipelines and so on, > and > the kernel had no way to group processes. Enter "container" support, and me > helping the parallels/OpenVZ guys explain _why_ the kernel could benefit from > it. (The number of times I've been hired as a programmer and wound up spending > most of my energy as a combination tech writer and marketer...) > > Sigh, I gotta go get on an airplane now, so stopping here for the moment... > > Rob > _______________________________________________ > Toybox mailing list > Toybox@lists.landley.net > http://lists.landley.net/listinfo.cgi/toybox-landley.net _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net