(sorry. busy.) On Tue, Apr 16, 2019 at 9:50 PM Rob Landley <r...@landley.net> wrote: > > On 4/16/19 7:42 PM, enh wrote: > >> Line buffered is the behavior I've been assuming, but I see glibc defaults > >> to > >> "unbuffered"... in which case what is FILE * _for_ again? A strrchr() on > >> each > >> buffer with a flush to that point isn't that expensive, and most of the > >> stuff > >> that cares about micromanaging buffers is already using filehandles > >> instead of > >> FILE * anyway... > > > > i think the problem here is that you're looking at this through the > > wrong lens. i think your mental model is "interactive user", whereas > > i'm thinking of them *and* of scripts/builds. > > I'm thinking of both, but didn't see how scripts should be significantly > penalized here?
because you've taken away the I/O buffering and replaced it with *extra* work to look for \n? > > the traditional behavior of trusting the C library's buffering and > > having just one exception (that i've ever seen) in grep _is_ sane. > > > > i don't know where you misapprehensions about libc's behavior are > > coming from... > When I first met Denys Vlasenko in person (the guy I handed busybox > maintainerhip off to), it's because he attended a US conferene talking about > how > he'd fixed a glibc bug that caused statically linked binaries with > --gc-sections > to not flush stdout on exit. > > https://elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf > https://bootlin.com/pub/video/2010/elc/elc2010-vlasenko-dead-code.ogv > > (I hit it long before then, but he root caused it. It was of course an issue > where the glibc developers and the gcc developers were pointing at each other > and each telling the other to change, and as a result the issue didn't get > fixed > for something like 5 years. Of course I think that was back when Ulrich > "static > linking considered harmful" drepper was still maintainer, > https://www.akkadia.org/drepper/no_static_linking.html. He was replaced by a > committee that broke up in 2012 https://lwn.net/Articles/488847/ but it's > still > a gnu/dammit project and that's bad.) > > The "tee" command broke so it only passed through a 4k block of data at a time > (rather than unbuffered output) _three_times_ over the years I used it. As in > broke got fixed and broke again, 3 times, all on Linux. And that's not > counting > busybox breaking it and me fixing it there: > > http://lists.busybox.net/pipermail/busybox/2004-May/045755.html > > And yes, the fix was to replace fread(), which was 4k buffering, with read(). i'm not claiming you never want line buffering. i'm claiming it's not a good default. > Here's a quick grep of my blog of _just_ fighting with the "tee" command over > the years, not all of which was glibc of course: > > https://landley.net/notes-2005.html#23-08-2005 > https://landley.net/notes-2006.html#06-03-2007 > https://landley.net/notes-2008.html#07-10-2008 > https://landley.net/notes-2008.html#03-07-2008 > https://landley.net/notes-2011.html#24-08-2011 > > The deepest dive to fix a tee issue slightly predates the current iteration of > my blog: > > https://marc.info/?l=user-mode-linux-devel&m=111056135413631&w=2 > > And the most _recent_ time glibc broke this was 2017: > > https://landley.net/notes-2017.html#08-04-2017 > > And yes, it was "stdout winds up fully buffered for some unknown reason, same > build against musl did not, fix it with gratuitous flushes"... > > > they're not true of bionic, nor any version of glibc > > i've ever seen. > > How about if I explicitly ask for _unbuffered_ in the common code (which > according to you should be a NOP) and then implemnt --line-buffered for grep? yeah, that should be the minimum for me to move forward, but i think we'll be back here later to fix this properly. > > POSIX explicitly says: > > > > When opened, the standard error stream is not fully buffered; the > > standard input and standard output streams are fully buffered if and > > only if the stream can be determined not to refer to an interactive > > device. > > So they don't specify whether it's linebuffered or unbuffered, just not > _fully_ > buffered. And if stdout is fully buffered when you pipe it to tee you get the > 4k > blocking problem again. So the bug kept showing up because the standard said > the > bug should show up, great. > > Except... the fix in the busybox case was on the _read_ side? It's fully > buffering INPUT by default too? That's INSANE. If 4k of data is waiting when > you > read, you get/process 4k. If less is waiting, PROCESS WHAT YOU GOT... well, that's an unrelated issue. the order of the two size_t arguments to fread is really important --- it's going to do exactly what you tell it, and they're two pretty different behaviors. > > that's the only behavior i've ever seen implemented, and that's the > > behavior you want. scripts/builds don't get pessimized performance, > > and interactive users don't have to wait for 1KiB/4KiB/8KiB. > > > > (of course, this assumes that the application isn't flushing way too > > often, which i claim is the real and only bug here.) > > Which I just removed. I agree, it was flushing way too often, because I was > being more paranoid than necessary about error checking. > > But "stdout is unbuffered" and "we flush(stdout) after write" is the > _same_thing_ isn't it? no, because unbuffered is really expensive. one of my first patches was fixing this bug in dmesg, if you remember :-) > If it's fully buffered, then "grep | tee" doesn't show you lines as it finds > them. (Not with all those extra flushes removed.) If it's unbuffered, we might > as well have left xprintf() flushing each write. that's WAI, though. and exactly why --line-buffered exists. for that one special case that you normally don't need. > Fully buffered pipeline output means "watch" will never progressively display > anything unless it fakes a pty to disable that behavior. (Remember how I > rewrote > watch so it restarts in a timely fashion rather than blocking arbitrarily long > waiting for output to finish and _then_ timing out? I'd have to _undo_ that > if I > humor this tty weirdness, 80x25=2000 a full screen is half this buffer...) > > > strictly speaking, glibc has a helper program (stdbuf(1)) that lets > > you modify the buffering for the program it calls, but no other libc > > (that i know of) has anything similar, and i've never seen the glibc > > one used. as far as i know, the grep --line-buffered option, which is > > in both BSD and GNU, is the only place where a command-line utility > > explicitly offers fine control. and since Android's on-device grep has > > always been [Net]BSD and the build grep has been GNU grep, both > > device- and host-side expect this behavior, and control it where > > necessary with --line-buffered. > > When you call write() it makes a system call, invokes the scheduler, copies > data > between process contexts, often invokes another process that's been waiting > for > CPU longer, there's cache flushing all over the place (especially with the > spectre mitigations), and it can have a huge impact on performance if your > bufering granularity is wrong. yes, which is why i'm asking you to not default to line buffering, which effectively makes the buffer very small *and* adds extra overhead to decide when to flush. > At LinuxWorld Expo 2001 Jon "Maddog" Hall told an anecdote in one of his > panels > about an old mainframe batch job writing its output to reel-to-reel tape a > byte > at a time, each time 1) spinning the tape up to speed skipping tape as it did, > 2) writing a data packet with a header and footer around the single byte of > data, and 3) spinning the tape back down skiping more tape. This ate a dozen > tapes and took hours with someone standing by to switch tapes. He changed the > code so it wrote out 64 bytes of data at a time, and suddenly everything fit > on > a single tape and was done in 15 minutes, and best of all it could run in the > middle of the night with nobody standing by and they could collect the tape in > the morning. > > I have encountered variants of this issue dozens of times over the years, in > many different contexts. It's why TCP/IP has nagle's agorithm, which can be > sung > to the dreidl song. It's one of the two hard problems in computer science > (naming things, cache invalidation, and fencepost errors). (this is why i'm so confused by this thread and resent the patch because i thought you'd come over to my side --- you keep seeming to argue *my* side.) > What I'm trying to avoid is extra write() system calls and the resulting > context > switches and cache flushing, which seem likely to penalize performance far > more > than cache-local fidding with an 8k buffer. > > The "line buffered" mode boils down to strrchr('\n') on the data we just wrote > and if we find one flush() to that point on the way out. It should never cause > _more_ write() calls than unbuffered mode (except in the case where you > explicitly fflush() afterwards to output a partial line). i'm not asking for unbuffered. no-one is asking for unbuffered. only stderr is unbuffered. here's copy & paste direct from bionic: MAKE_STD_STREAM(__SRD, STDIN_FILENO), MAKE_STD_STREAM(__SWR, STDOUT_FILENO), MAKE_STD_STREAM(__SWR|__SNBF, STDERR_FILENO), (__SNBF is the "this is unbuffered".) stdout is buffered. that's what the fstat you see in strace is for: if it succeeds, you get an st_blksize buffer. if not, BUFSIZ. > So you still get the vast majority of the performance boost from _not_ having > multiple writes per line when grep is calling a bunch of little printf(), but > an > average output of 50 chars or so is still 1/50th as many system calls and > should > be the bulk of your performance gain. > > > there are definitely places where traditional tools do something > > stupid, and no one actually wants the stupid, and i'm happy with those > > deviations. but this is a place where i think toybox is behaving > > differently to no advantage and some disadvantage, whereas stdio left > > to its own devices (without constant hammering of the flush button) > > does a good enough job for both distinct use cases. > > Sigh. My next guess would be a --set-block-buffered option for grep (which can > call setvbuf(_IOFBF) to put it back easily enough), but that's not what the > other one has and I doubt you want that. > > The thing is even if grep were special-cased to fully buffer into pipelines by > default (and thus piping the output to tee would be interactively broken), > fully > buffering is seriously _not_ a performance boost in a lot of cases. In my > linux > source git checkout I just did (using the ubuntu 14.04 versions of > everything): > > $ time bash -c "grep -ir max_range --line-buffered | head -n 2" > kernel/dma/debug.c: unsigned int max_range = > dma_get_max_seg_size(ref->dev); > kernel/dma/debug.c: while (range <= max_range) { > > real 0m3.289s > user 0m3.268s > sys 0m0.022s > > $ time bash -c "grep -ir max_range | head -n 2" > kernel/dma/debug.c: unsigned int max_range = > dma_get_max_seg_size(ref->dev); > kernel/dma/debug.c: while (range <= max_range) { > > real 9m5.821s > user 4m8.223s > sys 0m7.350s > > Yes that grepped through the .git directory (with history back to 0.0.1), but > it's still 3 seconds vs 9 minutes on a 2.4ghz core i7. Because it chugs > through > gigabytes more input confirming there aren't more hits before producing _any_ > output. (And that was the _second_ run of the big one, to let the cache > populate > so you don't blame my spinning rust. The short one was uncached.) > > On an SMP system the faster the other program can receive the data the faster > it > can start processing it. The kernel guys redid the pipe infrastructure > extensively a few years back to mitigate the cost of ping-ponging between > processes, and going from 1 byte block transfers to ~60 byte block transfers > should be the bulk of the improvement in terms of context switching. (Heck the > old ATM network protocol used 64 byte packets and I think the maddog anecdote > was 64 byte packets? I have the auditape of all the panels somewherbut haven't > listened to them in years, I don't have a casette player handy). > > So waiting for 4k buffers can be a net _loss_ if you're getting less > parallelism. Have you benchmarked the performance penalty here, or are you > just > going based on the gnu/grep man page's admonition? for interactive use, i agree that this is fine (and possibly better in some places than the previous excessive flushing). for non-interactive use, no, i don't currently have a specific case where this is problematic. i'm just worried about being needlessly wasteful, and worried about my ability to measure any "death by a thousand cuts" lossage. > If you care strongly, I can do an if (CFG_TOYBOX_ON_ANDROID) to disable it for > you, but then interactive use would suck and it's unlikely to be a measurable > performance win in your scripts. And I'd much rather do a --block-buffered > command line option to _enable_ it in cases where it turns out to be a > thing... > > (Also, I'm doing line buffered _input_ and line-at-a-time regex. If grep is a > serious performance bottleneck for you, other grep implementations do block > buffered grep input and work out line breaks after they find matches. That's > probably more of a win than changing output buffering now that we're not > flushing multiple times per line. And yes, xprintf() plus xputc('\n') was > flushing at least twice per line before, my bad. Fixed now.) > > Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net