On 10/21/23 20:10, Ray Gardner wrote: > I am working on a little something, and I found that output from my toybuf > command redirected to a file is exceedingly slow. But if piped to cat, it > runs about as fast as expected.
Linux pipe buffers consolidate transactions. $ cat /proc/sys/fs/pipe-max-size 1048576 https://utcc.utoronto.ca/~cks/space/blog/unix/BufferedPipes If you set O_DIRECT on the pipe (see man 2 pipe, you should also be able to fcntl(fd, F_SETFD, FD_CLOEXEC); it) then it operates in "packet mode" so the far end gets the same transaction sizes from read() that the writer sent into the pipe. Which is important for the shell! Because otherwise stuff like: $ { echo one; echo two; } | { read i; echo shell got $i; echo head got; head -n 1; } shell got one head got two Can't do the handoff. (For a seekable file you can fseeko(ftello()) to undo the FILE * internal buffer readahead, but for a nonseekable file you can't put data BACK into the input, so you have to not get extra in the first place.) > So I messed around with the > TOYFLAG_LINEBUF and with setvbuf() modes. > > I put a line at the top of my program so it does only this: > > for (i = 0; i < 1024 * 512; i++) xprintf("%s\n", "asdfasdf"); return; > > Then I tried putting just in front of it one of these: > > setvbuf(stdout, 0, _IOLBF, 0); > setvbuf(stdout, 0, _IOFBF, 1024); > > Can anyone explain these odd timings? With line buffering encountering \n in the input stream flushes the FILE * buffer so you're making a write() syscall every 7 bytes of output. With an explicit buffer size of 1024 you're making a write() syscall every 1024 bytes of output. (Meaning if you printf("It worked\n") and then do something else for a while without exiting or manually flushing, they won't get notified it worked.) The 7 byte one is making 146 times as many write() syscalls, each of which has a kernel entry and (on modern systems that worry about heartbleep and melthammer and so on) a page table swap. Cutting edge processors can change the page tables WITHOUT flushing TLB... which I think is one of the things that opened the darn sidechannel information leaks in the first place and they probably have some nonsense explicitly disabling that. I watched them argue about it but didn't take notes because it's chip-specific in ways you'd usually do a microcode update for but "not in this instance, captain"... As Scotty said about the Excelsior, the more they overhaul the plumbing the easier it is to stop up the drain. Racehorses die if you give them water at the wrong temperature, Indy 500 cars are even finickier and have a nasty habit of dousing pit crews with flaming methanol (which puts out light outside the human visible range when it burns so you CAN'T SEE IT in daylight, so people have caught fire and nobody around them could TELL for a while... As workplace safety issues go, "methanol fires" is a good one to avoid.) Sigh. Jon maddog hall had a lovely anecdote about a program near the start of his career running on a minicomputer saving output to reel-to-reel tape, and how a transaction took many hours to run and ate a half-dozen tapes, so somebody had to stay there and change tapes to let it complete. And he looked at the code and found it was doing single byte writes of its output, and each time the tape drive had to spin the tape up to speed (skipping some blank tape to do this), write out the single byte surrounded by "here's some data" and "that was the data" handshaking (bigger than the byte), and then spin the tape drive back down to a stop (skipping more blank tape). So even though memory was precious back then he added something like a 64 byte output buffer so it wrote the data in blocks, and the many-hour job suddenly took a few minutes and its output fit on one tape. I believe said anecdote was in one of his talks at LinuxWorld Expo 2001, which I bought the audiocasettes of and should try to digitize someday since I can't find them online... I keep quoting "There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors." which is apparently from Leon Bambrick. For posix systems, anyway. Over in the smp/threading/network space Mathias Verraes offers "There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery" Anyway, the "cache invalidation" line is all this buffering nonsense. If you don't buffer, it's slow. If you do buffer, you've opened a can of worms and have to collate it. The buffer is never the right size, the buffer is never flushed at the right time, information can get lost or duplicated going into and coming out of the buffer... I have this nasty habit of either wanting a black box that works and is NOT MY PROBLEM, or to do it myself. Any black box I have to FIDDLE WITH is black magic, and that's a skillset I prefer to admire from afar: https://vimeo.com/74965870 > with TOYFLAG_LINEBUF and: > setvbuf(stdout, 0, _IOLBF, 0); > real 0m13.281s > user 0m0.611s > sys 0m2.486s > > with TOYFLAG_LINEBUF and: > setvbuf(stdout, 0, _IOFBF, 1024); > real 0m0.087s > user 0m0.045s > sys 0m0.004s ... > Seems setvbuf(..., _IOLBF,...) alone is horribly slow, but with > TOYFLAG_LINEBUF it runs about the same as no setvbuf() runs with or > without TOYFLAG_LINEBUF. Using setvbuf(..., _IOFBF,...) without > TOYFLAG_LINEBUF takes twice as long, but with TOYFLAG_LINEBUF it runs way > faster (< 1 sec). > > These results were in Linux Mint. I got similar results with WSL Ubuntu, > but more exaggerated (bad times were worse). > > Again, this is with output redirected to a file. Redirecting to /dev/null, > or piping through cat or to a crc program that reads from stdin, they all > run quite fast. > > What gives? Sigh. Since I _have_ accumulated a lot of black magic knowledge over the years (I just don't like to RELY on it, or inflict the need on other people)... you're using a variant of gnome terminal, aren't you? Gnome terminal (and even quite distant derivatives that use its libraries) implement an unbuffered blocking tty device. As in writes to gnome terminal block until the recipient consumes them, which includes drawing and publishing the relevant screen update. Meaning "make V=1 -j $(nproc)" running under gnome terminal often runs significantly slower than the EXACT SAME BUILD piped through tee, because that inserts a pipe buffer the unread data can accumulate into. (The fact that tee blocks until all its outputs have consumed the written data doesn't mean that the pipe INTO tee can't buffer the data). That's a very small part of the reason I did https://static.lwn.net/2002/0117/a/blueberry.php3 years ago which was one of those "don't ask questions, post errors" thing that inspired the kernel guys to redo the kernel build process so instead of spamming all the gcc command lines to stdout it gave the modern summary, except they implemented it in make instead of python which is PROBABLY less bad but you'd need an old priest and a young priest to get a reliable answer on that one. (Alas, people fixing this these days tend to invoke threads, but... eh, better than NOT fixing it.) If you're wondering why so many of my build scripts pipe the output through tee even when they don't necessarily have to... old habit. Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net