On 1/2/24 16:58, Ray Gardner wrote: > On Mon, Jan 1, 2024 at 1:39 PM Rob Landley <r...@landley.net> wrote: >> ... [ a very long and detailed reply ] ... > > Rob, thank you for the "GIANT INFODUMP", and I mean that sincerely. It > took me a while to read it; it must have taken quite a while to write it.
It did, but you asked. And posting it to the list means I can refer back to it, and/or more people can learn it so they don't have to ask me. :) You know how I say I document compulsively? Combine stream of consciousness infodump with Pascal's Apology: https://www.npr.org/sections/13.7/2014/02/03/270680304/this-could-have-been-shorter And you get documentation. Editing it DOWN, figuring out a non-dupliciative sequence where I'm not assuming knowledge I haven't explained yet, and chopping it into bite-sized chunks, is the hard part. Blathering like this is easy. Turning into a FAQ entry or something is hard. > A lot of info on kernel-level memory management, I think I got about 90% > of it but I'll have to look up some stuff (PLT, GOT, ...). Procedure Linkage Table and Global Offset Table. The first tracks where dynamically linked functions live, the second tracks dynamically linked global variables live. Ok, take everything here with a grain of salt because I last had to know this in detail back around 2010 and I largely avoid dynamic linking when I can because is really messy. I am PROBABLY getting this wrong, but off the top of my head: [Note: Elliott started another thread while I was traveling with this half-finished, and he can correct most of the stuff I get wrong. I'm also pointing you at where the kernel code lives, and other references.] When you exec() a file, Linux checks the executable bit (if it's not executable it won't even try, and the suid and sgid bits get handled here too), and then does some simple type identification on it, which involves waving it at the "binary format loaders" to see if any claim it. (This is a bit like filesystem probe functions during mount, only for file data instead of block device data.) $ ls linux/fs/binfmt* linux/fs/binfmt_elf.c linux/fs/binfmt_flat.c linux/fs/binfmt_elf_fdpic.c linux/fs/binfmt_misc.c linux/fs/binfmt_elf_test.c linux/fs/binfmt_script.c (Sadly, these can all be kernel modules so you can DYNAMICALLY LOAD a BINARY FORMAT LOADER which is just wrong.) The main one that gets 90% of the use is binfmt_elf, the kernel's ELF executable loader. We'll come back to that. The "binfmt_script" one gets almost all of the rest of the use: it checks if the first two bytes of the file are #! and if so it re-runs the exec call with the /path/after/that as the new file argument, and inserting everything after the first space in that line as argv[1] with the remaining arguments (if any) bumped to argv[2] and friends. This is how shell scripts work, and the mechanism perl and python and so on inherited. It's also how you can use tinycc to run C as a scripting language with the first line being "#!/usr/bin/tinycc -run" which turns into "tinycc -run file.c" so it compiles, links, and executes it instead of writing it out to a file. And yes, it catches: $ echo \#\!$(readlink -f bang.sh) > bang.sh && chmod +x bang.sh && ./bang.sh bash: ./bang.sh: /home/landley/bang.sh: bad interpreter: Too many levels of symbolic links The elf_fdpic one is the nommu variant of elf, which REALLY SHOULD be a couple of if () statements in the same file but they did an ext2/ext3 thing and duplicated the file, but unlike deleting both of those and just having ext4 handle all three variants of the same format in modern systems, the linux-kernel guys never went back and cleaned that up because linux-kernel developmet is almost completely ossified and bureuacratically paralyzed these days. Oh well. You can ignore binfmt_flat as obsolete. It was the nommu fork of binfmt_aout which was the old executable format before everybody switched to ELF in 1996. There was a binfmt_aout.c which got removed in kernel commit 987f20a9dcce in 2022. I wrote more but am making it a FOOTNOTE. (See footnote.) People mostly stopped writing new ones once binfmt_misc was invented, because that sucker's programmable. It's basically a binfmt_script that can be programmed (via /proc) to recognize arbitrary file formats and run arbitrary commands to handle them: https://docs.kernel.org/admin-guide/binfmt-misc.html If you've ever run an arm binary on x86 and it magically called qemu application emulation for you, that's because some init script setup a binfmt_misc association to do that. I have no idea what binfmt_elf_test is, it was introduced recently (commit 9e1a3ce0a952 in 2022) and from the commit message the Linux Test Project people crapping unnecessary complexity into the mainline kernel for no obvious reason. It's the kernel equivalent of checking in debug printfs. Make an EFFORT to ignore that one, it's NOT REAL. Ok, so back to the ELF loader. We've more or less covered static linking earlier, where the loader parses the tables and does a bunch of mmap() and puts data in the right place and jumps to the program's _start symbol (actually the "entry point address" field in the initial 127 byte ELF header struct at the start of the file, but by default the linker will stick the address of the _start symbol in there. You can override it with "ld -e symbolname" if you really want to, but you're probably skipping various setup libc kind of expects if you do that. main() actually gets called from _start() and returns to it, and _start lives in crt1.o ala: readelf -a /usr/lib/*/crt1.o | less Anyway, tangent. Dynamically linked Elf binaries work a bit like binfmt_script or binfmt_misc above, in that instead of executing the binary directly like it does with static linking, for dynamic binaries binfmt_elf will find a special entry in the ELF headers that points at a DIFFERENT program, and runs that instead: $ readelf -a /bin/ls | grep Requesting [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] This is called the "dynamic linker/loader" and has a man page: "man 8 ld.so". The one in glibc is INSANE. The one in musl-libc is... less insane. Did you know that the glibc's "ldd" program that lists the libraries an ELF file is linked against is ACTUALLY A SHELL SCRIPT, and all it really does is set the LD_TRACE_LOADED_OBJECTS environment variable? AND if you set that yourself you can't run any dynamic elf binaries which has been used in all SORTS of security exploits (the next command won't run, just spam some stuff to stdout and return success): $ LD_TRACE_LOADED_OBJECTS=1 /bin/ls linux-vdso.so.1 (0x00007ffd24b48000) libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fce46d98000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce46bd8000) libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fce46b64000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce46b5f000) /lib64/ld-linux-x86-64.so.2 (0x00007fce46fe2000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce46b3e000) And yet despite that, /usr/bin/ldd is 193 lines long because GNU. Anyway, GOT and PLT are arrays assembled/populated by the dynamic linker when it's loading and resolving the dynamic symbols for a given program. It's the "I already loaded that, here's where it lives memory" table it adds stuff to as it grabs symbols. There is something called "lazy binding" which means it can defer loading symbols until they're accessed the same way the MMU can defer faulting in physical pages for a mapping, and I totally forget what that looks like in the GOT and PLT but you can set LD_BIND_NOW to force it to resolve everything up front. You can also point LD_PRELOAD at a (space or colon separated) list of libraries to load before loading any other library, which lets you override any function, which is how I used to get vim to STOP DOING GRATUITOUS SYNC CALLS EVERY 100 CHARACTERS I TYPE ala: $ cat /home/old/2012/thwim.c // Stub out all the "sync" variants, for projects that regularly pause // waiting for nonessential data to hit physical media. (On a loaded system, // this can easily be a 30 second wait.) // cc -fpic -shared thwim.c -o thwim.so // LD_PRELOAD=/usr/local/lib/thwim.so vim int fsync(int fd) { return 0; } int fdatasync(int fd) { return 0; } void sync(void) { return; } Oddly enough, I vaguely recall the PLT and GOT being new inventions (well, like 20 years ago now). Way back in the dark ages the linker (ld) would turn each reference to a given symbol into a linked list (where the pointer for the access wouldn't point to the symbol, but would point to the location of the NEXT access to it), with the ELF table entry for the unresolved dynamic symbol pointing to the first access (head of that particular linked list). Then the dynamic linker (ld.so) would create the .text mapping via mmap(MAP_PRIVATE, PROT_WRITE) and go through those linked lists at load time (lazy binding wasn't an option yet) to replace the address of each as yet unresolved jump instruction with where it had dynamically loaded that function or global variable, and then turn the mapping read-only when it was done (don't ask me how: neither mremap() nor madvise() currently offer an obvious way, possibly mmap(MAP_FIXED) over itself got special cased? This is back when the stack was still executable, and also setting up these mappings was black magic happening inside the dynamic linker back before I ever looked at its source...) The real problem with this approach is every time you modify a writeable MAP_PRIVATE page you dirty it, breaking the shared mapping and doing a copy-on-write to create your private copy. And since accesses to dynamic symbols were scattered all over the code, this dirtied a LOT of pages. I learned about this because embedded developers hated it. Even when a Linux desktop system only had 16 megs of ram nobody cared THAT much about an extra 128k of memory getting consumed by a process, but embedded systems cared about saving individual pages. This was one of the original reasons busybox was a single binary that could be statically linked, because then if the stars aligned you only needed THREE PAGES of memory to spawn a new instance of a command line "true". Dynamic linking was just way too expensive for embedded systems to use, because the binary may be smaller but the runtime memory usage was way bigger, due to breaking sharing on the .text pages. And this affects system performance by thrashing the CPU cache, or back at the time the memory bus. I wrote an explanation about this for The Motley Fool in a previous life (long story): https://www.fool.com/archive/portfolios/rulemaker/2000/02/23/inside-intel-again-cold-hard-cache.aspx That said, doing this _can_ still be an optimization, because with a PLT "struct walrus *potato;" it's actually (struct walrus *)got[POTATO_IDX] but you avoid an extra dereference by NOT bouncing off the PLT or GOT, and instead patching the source address to go directly to the destination. These days with instruction reordering and speculative execution down insanely deep pipelines what you'd actually be saving is L1 cache lines, and you probably have to benchmark it to see which is a win. To be honest I learned about this stuff by asking dumb questions on mailing lists over many years: https://www.uwsg.indiana.edu/hypermail/linux/kernel/0309.1/0716.html And by this point I expect Elliott has a backlog of facepalms from my attempts to explain... >> yes "ELF format" is like "ATM machine" > > where I use my PIN code? > > One bit I can contribute: BSS is an assembler directive dating back at > least to the 1960s and probably earlier (don't ask me how I know). It was > used to reserve uninitialized space; BSSZ was used to reserve space zeroed > out at load time. Don't know if it's in any current assemblers. > > I tried inserting a printf of sizeof(TT) and find that it does report only > the global size of my own toy. Because TT is #defined as the specific struct out of the union. You #define FOR_thingy and then #include "toys.h" and that (eventually) pulls in generated/flags.h which does: #ifdef FOR_acpi #define CLEANUP_acpi #ifndef TT #define TT this.acpi #endif ... #endif (Which is a bad example because it doesn't have a GLOBALS block so TT is defined to something that doesn't exist, but the command never uses it so that doesn't cause a problem...) Anyway, the first time TT is #defined to this.thingy, and "this" is the union at the end of generated/globals. In that file, each GLOBALS() block gets turned into a struct FILENAME_data { } block, and then the union at the end has an instance of that struct added to it, all next to each other. (This is generated by scripts/make.sh currently line 247-ish. Its severeal calls to sed, except macintosh sed is trash and instead of replacing it they install "gsed" alongside it, so $SED points to the usable sed. Sigh... > I should have tried that before I asked > about it, and looked at how TT is defined. (I was thinking it was the > entire "this" union but obviously it could not be, given how globals are > accessed in each toy. Braino...) It just means measuring that doesn't say how much space the union is actually using. The problem is generating these headers doesn't depend on the current .config: it goes through and seds out ALL the GLOBALS() blocks, and generates the corresponding struct definitions with an instance of each in the union, and that means the union size is the high water mark of every GLOBALS() block, which is currently "ip" out of pending even if it's not enabled. What I need to do is put USE_FILENAME() macros around each union instance. (The name of the file and the name of the first command in the file aren't exactly independent. I've tried to unstick them a bit, but different bits of plumbing look at different things and if the name of the first command in a file isn't the same as the name of the file, more than one thing already gets unhappy.) The cleanup when you do #define FOR_othercommand and #include "generated/flags.h" state transitions isn't perfect either: you'll notice above the #define TT has an #ifndef around it, it only does it the FIRST time. A file can only have one GLOBALS() block (which is why some of them, like the one in ps.c, has a union at the start where the different commands option strings can populate different arguments), and the struct created by that GLOBALS block is named after the file it's in. Which means the first #define FOR_filename is theoretically redundant, but the builtin __FILE__ macro has a path and extension on the filename, and the preprocessor isn't smart enough to do "basename -s .c" or similar like I can in shell script, and TRYING to get the preprocessor to do anything fancy is a bad idea. So I specify some things redundantly, by hand. Which have to match up with other things. First command needs to be named the same as the file it's in. Oh well, adding another of that doesn't make it WORSE... > I know you aren't too big on using "const", but you said (implied?) it > could put data into the rodata section. For example, would it be > beneficial to do this: > > static char const * const msg = "a message"; > static char const * const msgs[] = { "msg1", "msg2", 0 }; The first one, you don't need the pointer. The string constant already resolves to a pointer. The second one, the problem is it's still a named symbol. I think we went over that in another thread? (If not ask again, but I'm tried right now...) > Ray Footnote: binflat is to fdpic what a.out is to elf, and a.out is toast but flat still has a few users, mostly old hardware that hasn't implemented fdpic support yet. (Because it uses 3 extra registers, and each new target has to define a new ABI specifying which registers are used for which segment, and then tweak the compiler to output the right code, and for stuff like "coldfire" nobody seems to have bothered. (Coldfire was a nommu m68k variant with a small number of high-volume users, so it wasn't widely used but it shipped a LOT of units. Motorola sold its chip division to Freescale which was bought by NXP, thus https://en.wikipedia.org/wiki/NXP_ColdFire). The a.out format is what Bell Labs Unix used back in the 1970s. (If you didn't tell Dennis Ritchie's original C compiler what to call the output file it wrote to "a.out" as the default, and linkers still do that today.) And that format worked GREAT on a PDP-11 with static binaries, but trying to make dynamic linking work with it on 32 bit systems was a mess. Linux journal did several articles during the original a.out->ELF transition back in 1995: https://www.linuxjournal.com/article/1139 https://www.linuxjournal.com/article/1059 https://www.linuxjournal.com/article/1060 And the Linux Documentation project did a writeup: http://web.mit.edu/linux/redhat/redhat-3.0.3/i386/doc/HTML/ldp/ELF-HOWTO-1.html The libc5 (Linus's libc)->libc6 (Ulrich Drepper's "glibc 2.0" fork, http://freesoftwaremagazine.com/articles/history_of_glibc_and_linux_libc/) transition happened around the same time, although that was driven by adding thread support (because Java can't NOT use threads, because James Gosling refused to wrap poll and select so the only way you can do nonblocking I/O in Java is spawn a thread to block for you, which had BUCKETS of problems and the linux kernel guys spent a good decade trying to whack-a-mole them. Threads were invented because Solaris's fork() was very slow, as explained in the famous post where Sun engineer Bryan Cantrill convinced the entire Linux community that Sun was too dumb to live with one sentence: https://landley.net/history/mirror/linux/kissedagirl.html .) ELF was pretty much ubiquitous by Y2K, and Linux finally got around to officially deprecating the old one in 2019: https://www.linuxjournal.com/content/deprecating-aout-binaries But they didn't remove binflt because not every nommu system has implemented support for fdpic yet, because most embedded developers won't go near the wretched hive of scum and villainy that is the linux kernel mailing list, and half of them are still using the 2.6 kernel (or even 2.4, and in extreme cases 2.2 or 2.0) anyway. I've done a lot of work to try to make it POSSIBLE to use current stuff in that space, but I'm up against the fact that an allnoconfig 2.2 kernel was like 200k and an allnoconfig 6.0 kernel is well over a megabyte, so... If you want to know about the guts of ELF, the book "linkers and loaders" is still definitive (and BONE DRY. I had to read it long ago and it was SO BORING). I have a paper copy but the author made it available free online: https://www.iecc.com/linker/ Although you might want the book in a more modern format: https://github.com/last-genius/comp_arch_list/blob/master/books/Linkers%20and%20Loaders%20by%20John%20R.%20Levine.pdf A subset of the horrific standards documents are at: https://refspecs.linuxfoundation.org/ I had to read while trying to debug issues with a new dynamic linker implementation for a new architecture. It's also useful if you're maintaining your own tinycc fork (https://landley.net/hg/tinycc) or debugging the guts of qemu (https://lists.nongnu.org/archive/html/qemu-devel/2010-02/msg00973.html). _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net