On 8/14/21 7:10 AM, Samanta Navarro wrote: > Hi Rob, > > I hope that you have recovered from your sickness by now!
Alas, no. :( That's why I've been holding off on dealing with stuff like this, I wanted to give it more attention than I currently have, but eh, you work with the brain you've got not the one you'd like. (Warning: Pascal's Apology for writing a long letter because he didn't have the focus/spoons to write a short letter: my primary failure mode is blathering/tabsploison/tangents and I type fast.) > On Mon, Aug 09, 2021 at 02:44:45AM -0500, Rob Landley wrote: >> > The functions readall and writeall can return an error value by mistake >> > if more than 2 GB of data are read or written. >> >> That was intentional. If your file sizes are that big we probably want to >> mmap() >> stuff. > > The functions read and mmap have different use cases. Yes, I know. > You cannot mmap a > pipe, a socket or any other form of byte streams. Yes, I know. >> Actually on 32 bit Linux architectures ssize_t is also long long because >> "large >> file support" was introduced over 20 years ago: > > Did you mean off_t? The signed size_t type is 32 bit on 32 bit systems. > But off_t depends on large file support compiled in. So it's sometimes > 32 and sometimes 64 bit. This is why I hate the magic macros. I always have to look up what they mean when. I use the real types wherever possible because they I DON'T have to track weirdness. When dealing with C, I like to know what I'm doing. >> So if we're changing the type it should change to long long > > I disagree here. As do I: if the old one is sometimes int then the new one should be int, I.E. what it is now. I just need to audit the callers, and add error checking in some of them. > First off, I would not recommend to use "long long" > just because it's most of the times of the same size. It's the 64 bit primitive integer type? > The data types > exist for a reason, the most important I think is the implied intention > of their use cases. Toybox relies on the LP64 memory model, as documented in https://landley.net/toybox/design.html#portability > Use size_t for memory operations. No. > Use off_t for file operations. No. I am _aware_ of what the standards say, I just don't care about portability to Windows. And nothing else _isn't_ LP64 anymore. (And even Windows has added its own eniw implementation, that's wine spelled backwards with according to Sam Vimes the possible addition of eniru, to run linux binaries on windows.) Linux is lp64, bsd is lp64, mac is lp64, android and ios are lp64, zseries is lp64, slowaris and aix and hp-ux and irix and OSF/1 all were lp64... There were a couple of historical oddballs that got things wrong, like the Hal Solaris port of Solaris wasn't compatible with Sun's Solaris (and its corpse was acquired by Fujitsu in 1993) or the truly INSANE 1990's Cray Unicos version that made even "short" 64 bits got replaced with an LP64 version over 20 years ago, but nothing since the whole Y2K frenzy that I am aware of. You have indeed identified a real bug: read with a size range restricted to int can have callers asking for more data. My reply was that it never SHOULD have callers asking for more data (that's pilot error), and I need to figure out how to enforce that. My suggested FIX is to make sure the xread/xwrite callers are all ok with a 2 gigabyte granularity. What is the actual _problem_ with that fix? (Elliott and I already had a tangent about readline(). Most of the other "process a buffer in a loop" code uses toybuf or libbuf aka page size, which is cache friendly while avoiding the worst of byte-at-a-time processing overhead.) > Use > long long if your prefered C standard is too old for int64_t or the > API of library functions in use want long long. Bionic: libc/include/stdint.h:typedef long long __int64_t; libc/include/stdint.h:typedef __int64_t int64_t; Musl-libc: include/alltypes.h.in:TYPEDEF _Int64 int64_t; arch/i386/bits/alltypes.h.in:#define _Int64 long long arch/aarch64/bits/alltypes.h.in:#define _Int64 long arch/mips64/bits/alltypes.h.in:#define _Int64 long arch/mips/bits/alltypes.h.in:#define _Int64 long long arch/mipsn32/bits/alltypes.h.in:#define _Int64 long long arch/powerpc64/bits/alltypes.h.in:#define _Int64 long arch/s390x/bits/alltypes.h.in:#define _Int64 long arch/powerpc/bits/alltypes.h.in:#define _Int64 long long arch/x32/bits/alltypes.h.in:#define _Int64 long long arch/x86_64/bits/alltypes.h.in:#define _Int64 long arch/or1k/bits/alltypes.h.in:#define _Int64 long long arch/sh/bits/alltypes.h.in:#define _Int64 long long arch/xtensa/bits/alltypes.h.in:#define _Int64 long long arch/microblaze/bits/alltypes.h.in:#define _Int64 long long arch/arm/bits/alltypes.h.in:#define _Int64 long long glibc is of course an ifdef salad that checks if sizeof(long) is 64 bits and uses that else uses long long, because GNU being chock full of unnecessary code that does nothing (and often does nothing WRONG) is why busybox and uclibc and so on existed in the first place... > Since read and write are used to operate on memory, size_t is the best > choice. Destination in memory and transaction size are two different things. Offset within the file is a third thing. Each could theoretically have a different type. > Or ssize_t for included error handling. And this is exactly what > the underlying C library functions do. I first triaged the C spec in 1991 (yes using Herbert Schildt's annotated version, it was all I could afford), because as a teenager I independently reinvented the concept of bytecode (not knowing about the half-dozen previous instances like Pascal p-code) and I was trying to come up with a C compiler that would produce bytecode running in a VM. (This is why I first started reading the gcc source code, which was UNIMAGINABLY bad. I also looked at the "small C compiler" from Dr. Dobbs, but it wasn't load bearing. Years later I got involved in https://landley.net/hg/tinycc as vestigial momentum from that. And yes, bytecode function pointers and native function pointers would have to be two different types so would need annotating...) Then I graduated and went to work at IBM porting OS/2 to the PowerPC where a coworker introduced me to Java, and I pointed out that they'd missed "truncate", and Mark English of sun replied that I'd just missed the 1.1 cutoff but he'd add it to java 1.2 (and did). Meanwhile I sat down and started porting the "deflate" algorithm to java (based on the info-zip code), and got compression side working but not yet decompression when the next version came out with zlib as part of the java standard library. So I'm used to doing work that gets abandoned or undone again, that's normal. I've had a copy of https://landey.net/c99-draft.html on my own website for easy reference since the busybox days (so at least 15 years now). The header says it came from http://web.archive.org/web/20050207010641/http://dev.unicals.com/papers/c99-draft.html so that's probably about when I snapshot it (Feb 2005, I tend to mirror locally when the original goes down). I am entirely happy to have my opinions on this stuff changed by good arguments, but "reading the spec to me" is probably only going to provide new information if I _forgot_ something. >> P.S. One of my most unsolvable todo items is what to do about readline() on >> /dev/zero. If it's looking for /n it's just gonna allocate a bigger and >> bigger >> buffer until it triggers the OOM killer. If a single line IS a gigabyte long, >> what am I supposed to _do_ about it? > > I would say: Do what open does on a system without large file suport > with large files: Return an error. Elliott outranks you, and he basically called it pilot error. (And treating it as such is the status quo.) I couldn't figure out what to do about this when I WASN'T sick, so... > And as it has been discussed in enh's thread: It depends on the > application. Does it need random access after parsing? Can it have > random access on the file? Is a streaming approach possible? mmap() on a 32 bit processor won't help much because your VIRTUAL address space is capped between 1 and 4 gigs depending on which Linux memory model you compiled the kernel with. You can remap as you go, but your "one big line" physically CAN'T be longer than 4 gigabytes and get seen as a block of data in memory. (And in theory there's both x32 and the arm equivalant ala https://lkml.org/lkml/2018/5/16/207 so this isn't _entirely_ a historical point.) The tool can't demand a use case, because I do: diff -u <(thingy) <(thingy2) All the time. Even if diff input is USUALLY seekable/mappable that isn't and diff needs to work for that. The question is "what do I want/need to support" (or at least opitimize for), and the answer sadly can't always be "everything". Having TWO codepaths to do the same thing in "fast" and "slow" cases is an invitation to bugs and something toybox tries to avoid, and yes "tail" does that and there's a comment apologizing for it but the speed difference on tailing a 500 gigabyte file is big enough I couldn't get away with NOT doing it, and yes there were posts about it here in the archive and I probably blogged about it at the time, but it's one of them Kobiyashi Maru no win scenario things. You don't have to believe in them, reality is what persists when you STOP believing in it. Or possibly what happens while you're making other plans. Anyway, I dunno how to fix it and I'm out of brain. > Sincerely, > Samanta Rob P.S. The J.J. Abrams movies have done to ST2TWOK what the Hitchhiker's Guide movie did to the BBC miniseries, or what GPLv3 did to GPLv2. P.P.S. Kobiyashi means "small forest" and Maru means "circle" in japanese. These days it's where you find dragon maids and/or a particularly photogenic cat. _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net