On 4/16/19 7:42 PM, enh wrote: >> Line buffered is the behavior I've been assuming, but I see glibc defaults to >> "unbuffered"... in which case what is FILE * _for_ again? A strrchr() on each >> buffer with a flush to that point isn't that expensive, and most of the stuff >> that cares about micromanaging buffers is already using filehandles instead >> of >> FILE * anyway... > > i think the problem here is that you're looking at this through the > wrong lens. i think your mental model is "interactive user", whereas > i'm thinking of them *and* of scripts/builds.
I'm thinking of both, but didn't see how scripts should be significantly penalized here? > the traditional behavior of trusting the C library's buffering and > having just one exception (that i've ever seen) in grep _is_ sane. > > i don't know where you misapprehensions about libc's behavior are > coming from... When I first met Denys Vlasenko in person (the guy I handed busybox maintainerhip off to), it's because he attended a US conferene talking about how he'd fixed a glibc bug that caused statically linked binaries with --gc-sections to not flush stdout on exit. https://elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf https://bootlin.com/pub/video/2010/elc/elc2010-vlasenko-dead-code.ogv (I hit it long before then, but he root caused it. It was of course an issue where the glibc developers and the gcc developers were pointing at each other and each telling the other to change, and as a result the issue didn't get fixed for something like 5 years. Of course I think that was back when Ulrich "static linking considered harmful" drepper was still maintainer, https://www.akkadia.org/drepper/no_static_linking.html. He was replaced by a committee that broke up in 2012 https://lwn.net/Articles/488847/ but it's still a gnu/dammit project and that's bad.) The "tee" command broke so it only passed through a 4k block of data at a time (rather than unbuffered output) _three_times_ over the years I used it. As in broke got fixed and broke again, 3 times, all on Linux. And that's not counting busybox breaking it and me fixing it there: http://lists.busybox.net/pipermail/busybox/2004-May/045755.html And yes, the fix was to replace fread(), which was 4k buffering, with read(). Here's a quick grep of my blog of _just_ fighting with the "tee" command over the years, not all of which was glibc of course: https://landley.net/notes-2005.html#23-08-2005 https://landley.net/notes-2006.html#06-03-2007 https://landley.net/notes-2008.html#07-10-2008 https://landley.net/notes-2008.html#03-07-2008 https://landley.net/notes-2011.html#24-08-2011 The deepest dive to fix a tee issue slightly predates the current iteration of my blog: https://marc.info/?l=user-mode-linux-devel&m=111056135413631&w=2 And the most _recent_ time glibc broke this was 2017: https://landley.net/notes-2017.html#08-04-2017 And yes, it was "stdout winds up fully buffered for some unknown reason, same build against musl did not, fix it with gratuitous flushes"... > they're not true of bionic, nor any version of glibc > i've ever seen. How about if I explicitly ask for _unbuffered_ in the common code (which according to you should be a NOP) and then implemnt --line-buffered for grep? > POSIX explicitly says: > > When opened, the standard error stream is not fully buffered; the > standard input and standard output streams are fully buffered if and > only if the stream can be determined not to refer to an interactive > device. So they don't specify whether it's linebuffered or unbuffered, just not _fully_ buffered. And if stdout is fully buffered when you pipe it to tee you get the 4k blocking problem again. So the bug kept showing up because the standard said the bug should show up, great. Except... the fix in the busybox case was on the _read_ side? It's fully buffering INPUT by default too? That's INSANE. If 4k of data is waiting when you read, you get/process 4k. If less is waiting, PROCESS WHAT YOU GOT... > that's the only behavior i've ever seen implemented, and that's the > behavior you want. scripts/builds don't get pessimized performance, > and interactive users don't have to wait for 1KiB/4KiB/8KiB. > > (of course, this assumes that the application isn't flushing way too > often, which i claim is the real and only bug here.) Which I just removed. I agree, it was flushing way too often, because I was being more paranoid than necessary about error checking. But "stdout is unbuffered" and "we flush(stdout) after write" is the _same_thing_ isn't it? If it's fully buffered, then "grep | tee" doesn't show you lines as it finds them. (Not with all those extra flushes removed.) If it's unbuffered, we might as well have left xprintf() flushing each write. Fully buffered pipeline output means "watch" will never progressively display anything unless it fakes a pty to disable that behavior. (Remember how I rewrote watch so it restarts in a timely fashion rather than blocking arbitrarily long waiting for output to finish and _then_ timing out? I'd have to _undo_ that if I humor this tty weirdness, 80x25=2000 a full screen is half this buffer...) > strictly speaking, glibc has a helper program (stdbuf(1)) that lets > you modify the buffering for the program it calls, but no other libc > (that i know of) has anything similar, and i've never seen the glibc > one used. as far as i know, the grep --line-buffered option, which is > in both BSD and GNU, is the only place where a command-line utility > explicitly offers fine control. and since Android's on-device grep has > always been [Net]BSD and the build grep has been GNU grep, both > device- and host-side expect this behavior, and control it where > necessary with --line-buffered. When you call write() it makes a system call, invokes the scheduler, copies data between process contexts, often invokes another process that's been waiting for CPU longer, there's cache flushing all over the place (especially with the spectre mitigations), and it can have a huge impact on performance if your bufering granularity is wrong. At LinuxWorld Expo 2001 Jon "Maddog" Hall told an anecdote in one of his panels about an old mainframe batch job writing its output to reel-to-reel tape a byte at a time, each time 1) spinning the tape up to speed skipping tape as it did, 2) writing a data packet with a header and footer around the single byte of data, and 3) spinning the tape back down skiping more tape. This ate a dozen tapes and took hours with someone standing by to switch tapes. He changed the code so it wrote out 64 bytes of data at a time, and suddenly everything fit on a single tape and was done in 15 minutes, and best of all it could run in the middle of the night with nobody standing by and they could collect the tape in the morning. I have encountered variants of this issue dozens of times over the years, in many different contexts. It's why TCP/IP has nagle's agorithm, which can be sung to the dreidl song. It's one of the two hard problems in computer science (naming things, cache invalidation, and fencepost errors). What I'm trying to avoid is extra write() system calls and the resulting context switches and cache flushing, which seem likely to penalize performance far more than cache-local fidding with an 8k buffer. The "line buffered" mode boils down to strrchr('\n') on the data we just wrote and if we find one flush() to that point on the way out. It should never cause _more_ write() calls than unbuffered mode (except in the case where you explicitly fflush() afterwards to output a partial line). So you still get the vast majority of the performance boost from _not_ having multiple writes per line when grep is calling a bunch of little printf(), but an average output of 50 chars or so is still 1/50th as many system calls and should be the bulk of your performance gain. > there are definitely places where traditional tools do something > stupid, and no one actually wants the stupid, and i'm happy with those > deviations. but this is a place where i think toybox is behaving > differently to no advantage and some disadvantage, whereas stdio left > to its own devices (without constant hammering of the flush button) > does a good enough job for both distinct use cases. Sigh. My next guess would be a --set-block-buffered option for grep (which can call setvbuf(_IOFBF) to put it back easily enough), but that's not what the other one has and I doubt you want that. The thing is even if grep were special-cased to fully buffer into pipelines by default (and thus piping the output to tee would be interactively broken), fully buffering is seriously _not_ a performance boost in a lot of cases. In my linux source git checkout I just did (using the ubuntu 14.04 versions of everything): $ time bash -c "grep -ir max_range --line-buffered | head -n 2" kernel/dma/debug.c: unsigned int max_range = dma_get_max_seg_size(ref->dev); kernel/dma/debug.c: while (range <= max_range) { real 0m3.289s user 0m3.268s sys 0m0.022s $ time bash -c "grep -ir max_range | head -n 2" kernel/dma/debug.c: unsigned int max_range = dma_get_max_seg_size(ref->dev); kernel/dma/debug.c: while (range <= max_range) { real 9m5.821s user 4m8.223s sys 0m7.350s Yes that grepped through the .git directory (with history back to 0.0.1), but it's still 3 seconds vs 9 minutes on a 2.4ghz core i7. Because it chugs through gigabytes more input confirming there aren't more hits before producing _any_ output. (And that was the _second_ run of the big one, to let the cache populate so you don't blame my spinning rust. The short one was uncached.) On an SMP system the faster the other program can receive the data the faster it can start processing it. The kernel guys redid the pipe infrastructure extensively a few years back to mitigate the cost of ping-ponging between processes, and going from 1 byte block transfers to ~60 byte block transfers should be the bulk of the improvement in terms of context switching. (Heck the old ATM network protocol used 64 byte packets and I think the maddog anecdote was 64 byte packets? I have the auditape of all the panels somewherbut haven't listened to them in years, I don't have a casette player handy). So waiting for 4k buffers can be a net _loss_ if you're getting less parallelism. Have you benchmarked the performance penalty here, or are you just going based on the gnu/grep man page's admonition? If you care strongly, I can do an if (CFG_TOYBOX_ON_ANDROID) to disable it for you, but then interactive use would suck and it's unlikely to be a measurable performance win in your scripts. And I'd much rather do a --block-buffered command line option to _enable_ it in cases where it turns out to be a thing... (Also, I'm doing line buffered _input_ and line-at-a-time regex. If grep is a serious performance bottleneck for you, other grep implementations do block buffered grep input and work out line breaks after they find matches. That's probably more of a win than changing output buffering now that we're not flushing multiple times per line. And yes, xprintf() plus xputc('\n') was flushing at least twice per line before, my bad. Fixed now.) Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net