On Fri, Jun 22, 2018 at 4:06 PM Rob Landley <r...@landley.net> wrote: > > On 06/22/2018 03:24 PM, enh wrote: > > ‘fmt’ prefers breaking lines at the end of a sentence, and tries to > > avoid line breaks after the first word of a sentence or before the last > > word of a sentence. A “sentence break” is defined as either the end of > > a paragraph or a word ending in any of ‘.?!’, followed by two spaces or > > end of line, ignoring any intervening parentheses or quotes. Like TeX, > > ‘fmt’ reads entire “paragraphs” before choosing line breaks; the > > algorithm is a variant of that given by Donald E. Knuth and Michael F. > > Plass in “Breaking Paragraphs Into Lines”, ‘Software—Practice & > > Experience’ 11, 11 (November 1981), 1119–1184. > > So the change of indentation is being interpreted as a paragraph break and > causing it to behave differently. For a definition of differently that seems > more or less random here, but ok. > > *shrug* I could implement some sort of "last word ended with ispunct() and the > next word is short and would otherwise be the last word on the line" > detection, > but... not well defined and doesn't seem worth it? > > The two spaces after period thing went away in the 90's because html squashed > all whitespace into a single space so you'd have to if you wanted an > extra space after a period, and the tiny minority that bothered circa 1993 got > lost in the noise. After a few years of everybody seeing text with one space > after periods, anything else looked silly. At this point it's been stone dead > for well over a decade. > > And when I posted about it on twitter recently somebody pointed out that one > space after period was a macintosh peculiarity (as mentioned in the book "The > Mac is not a Typewriter"), and since Tim Berners-Lee implemented the first web > browser on a NeXT box he might have picked it up from there: > > https://twitter.com/steveax/status/1007482609838931969
i think it might have been an American thing. i first learned this was a thing from reading Knuth. i don't remember ever having double-spaced. who could afford that on a 40-column display? but then i can't be trusted to use capital letters most of the time. the original fortran source for adventure doesn't double-space :-) > But then again treating space, runs of space, and newline all the same > (resulting in a single space with line breaks as appropriate) is also really > simple programming, so maybe it was just that. :) > > >> If you remove the space after the newline they match, but testing fmt > >> without > >> indentation is missing like half the logic? I made the existing tests > >> pass, but > >> I want to add tests to actually test what the new one is doing, like > >> measuring > >> and preserving tab/space mixes in indents. But fmt turns into weird corner > >> case > >> city. I ran the README and main.c through it when developing it, but > >> that's not > >> a stable test I can put in the test suite... > > > > yeah, i hit this too, and most of my testing was done manually with > > toybox's README. (sorry, i think the gap between me starting on fmt > > and actually sending it in was large enough that i'd forgotten these > > details.) > > *shrug* I'm comfortable enough to promote it, just trying to figure out what > the > test cases should be. I wasn't previously a regular user of fmt and dunno what > success looks like here. :) as long as i can !!fmt when git commit drops me into vi... > I should add a test to make sure tabs in front get retained as such though. > The > code should be doing it, I just need a test... Um, the other one _is_ doing > that, right? > > $ echo -e '\thello\n\tworld' | fmt | hexdump -C > 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.| > $ echo -e '\thello\n\tworld' | ./fmt | hexdump -C > 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.| > > Yup, consistency! > > $ echo -e '\thello\n world' | ./fmt | hexdump -C > 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.| > $ echo -e '\thello\n world' | fmt | hexdump -C > 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.| > > Bwahahaha! > > Ok, now I'm curious: > > $ echo -e '\thello\n world and then more' | fmt -w 20 | hexdump -C > 00000000 09 68 65 6c 6c 6f 0a 09 77 6f 72 6c 64 20 61 6e |.hello..world an| > 00000010 64 0a 09 74 68 65 6e 20 6d 6f 72 65 0a |d..then more.| > $ echo -e '\thello\n world and then more' | ./fmt -w 20 | hexdump -C > 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a 20 20 20 |.hello world. | > 00000010 20 20 20 20 20 61 6e 64 20 74 68 65 6e 0a 20 20 | and then. | > 00000020 20 20 20 20 20 20 6d 6f 72 65 0a | more.| > > Yeah, they're being less lazy than I am. (I indent with whatever the current > line I'm splitting was used to indent with, provided the whitespace width > count > is consistent so it's the same paragraph. They're recording the string the > paragraph _started_ with. I don't think I care enough to fix it, it should > _look_ consistent and the inconsistency was in the input...) > > So what happens when... Nope, that's _not_ what they're doing: > > $ echo -e ' hello\n\t\tworld and then more' | ./fmt -w 20 | > hexdump -C > 00000000 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | > 00000010 68 65 6c 6c 6f 0a 09 09 77 6f 72 6c 64 0a 09 09 |hello...world...| > 00000020 61 6e 64 0a 09 09 74 68 65 6e 0a 09 09 6d 6f 72 |and...then...mor| > 00000030 65 0a |e.| > > ??? > > $ echo -e ' hello and then we wrap because\n\t\tworld and then > more' | ./fmt -w 25 | hexdump -C > 00000000 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | > 00000010 68 65 6c 6c 6f 0a 20 20 20 20 20 20 20 20 20 20 |hello. | > 00000020 20 20 20 20 20 20 61 6e 64 20 74 68 65 6e 0a 20 | and then. | > 00000030 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 77 | w| > 00000040 65 20 77 72 61 70 0a 20 20 20 20 20 20 20 20 20 |e wrap. | > 00000050 20 20 20 20 20 20 20 62 65 63 61 75 73 65 0a 09 | because..| > 00000060 09 77 6f 72 6c 64 0a 09 09 61 6e 64 20 74 68 65 |.world...and the| > 00000070 6e 0a 09 09 6d 6f 72 65 0a |n...more.| > > Nope, not going down this rathole. In the absence of a specification, I think > I'll stick with what I've got. the trivial algorithm has been good enough for me since 1992. > Planning to cut a release this weekend... > > Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net