I should clarify a couple of points here since I was feeling grumpy when I wrote the following, and that made me forget things.
At 2023-04-27T09:45:40-0500, G. Branden Robinson wrote: > We're re-covering some familiar ground here. > > I have a few points I'd like to make. > > 1. "Semantic newlines" is a terrible term. I should have said "_Warn on_ semantic newlines" is a terrible instruction/summary. They are what we _don't_ want to warn about upon encountering them. If man-pages(7) or other people continue to call the practice of breaking *roff input lines after sentence-ending punctuation "semantic newlines", I have no complaint. It could also be called "Kernighan breaking", in honor of an early popularizer of the practice. > 2. Bjarni's comment '"groff" is not the right tool for such things, > but "grep" is.' is thoroughly wrong-headed and Ingo was right to > reject it with great force. Here a few reasons why. I don't > think any of B through D are relevant to mandoc(1) since it > doesn't support the features in question (as far as I know). > > A. The formatter decides where sentence boundaries are based on > its input. > > B. Use of the `cflags' request can change the characters that > have sentence-ending semantics. grep(1) cannot know this. > > C. Sentence-ending characters are subject to character > translation (the `tr` request). grep(1) cannot know this. > > D. The user/document could define a special character that is a > sentence-ending character (with `char` and `cflags`). grep(1) > cannot know this. E. Because '.', '?', and '!' are valid characters in *roff identifiers, grep(1) can be fooled by special character, register, or string interpolations in the input if their identifiers use those characters. Example: I can't believe \*(I. ate the whole thing. It is only valid to detect the end of a sentence here if the (recursive) _expansion_ of the `I.` string ends with a sentence-ending punctuation character. Further, since string interpolations can result in further string interpolations, a finite-state automaton will not suffice to analyze this input. You need a stack machine. (IIRC, a stack machine recognizes "recursively enumerable" languages.) This is categorically not what regular expressions can cope with, formally. My vague understanding of modern regex implementations is that they are not finite state automata; the drive for extra features has caused them to add limited support for recursively enumerable languages. (If memory and comprehension serve, "backreferences" in matches, like "grep 'foo\(bar\)baz\1qux'" were the camel's nose admitting unbounded memory usage to the regex interpreters of the land. Perl added many more.[2]) But even knowing that modern regex engines aren't (more precisely: don't construct) strict finite state machines doesn't save you; they still understand only their own grammar, not *roff's, so they have no way of knowing how a *roff string will ultimately expand. And, to put a bow on that observation, by the time a grep(1) is looking at the line above, it has already discarded all of the input that set up the string definitions it would need to know. So that's yet another reason why, if mid-input line sentence endings are to be warned about, they must be detected in the formatter, or an interpreter for so much of the formatter's grammar that one might as well write a formatter. I think this is one reason all of the deroff(1) projects in the world have died. Eventually they will all fail given a sufficiently complex input. I don't have a theorem/proof to back this up, but my hunch is that since *roff is a Turing-complete language, then deciding what a *roff formatter will output with "all of the formatting stripped away" is equivalent to solving the halting problem. It occurs to me that the right way to attack the problem of extracting the text from a *roff document is to scrape it out of the device- independent output format. Only a handful of commands in that language produce text glyphs, and they are easy to parse. This _still_ isn't a 100% solution; access to the current font's glyphs by their index values can still conceal text.[3][4] But it strikes me as a far more reliable approach to several nines of efficacy in this task than any other I've seen. But as far as I know no one has ever done this. I admit that I'm baffled why not. Regards, Branden [1] I get the impression that Jeffrey Friedl quit updating his O'Reilly book on regular expressions because he kept getting punked on the Internet by (pseudo?)academics over the distinction between "regexes" (Unixy stuff that supports backreferences and all kinds of other un-Kleene extensions) and regular expressions "proper". While the distinction is useful--especially if you're a programmer and have decided to bite off the task of writing a regex matcher for yourself--the choice of terminology is poor because it's not distinct _enough_. It's extremely predictable that anyone not trained in automata theory is going to infer that "regex" is an abbreviation for "regular expression". What Unix people should do is simply be frank that software practitioners apply the term "regular expression" more broadly than computation theorists do. It's like how that neighbor of yours who is convinced of the healing power of crystals is concerned about the "chemicals" in our food... [2] And now I know why the camel was chosen as Perl's O'Reilly mascot. [3] Demonstration: $ printf '\\N@72@\\N@69@\\N@76@\\N@76@\\N@79@\\N@44@ \\N@87@\\N@79@\\N@82@\\N@76@\\N@68\n' | troff -Tascii x T ascii x res 240 24 40 x init p1 x font 1 R f1 s10 V40 H0 md DFd N72 H24 N69 h24 N76 h24 N76 h24 N79 h24 N44 wh48 N87 h24 N79 h24 N82 h24 N76 h24 N68 h24 n40 0 x trailer V2640 x stop [4] And if you know how the font is encoded, you are still not defeated. Historically, device-independent troffs do not report this information, but it would be straightforward to extend groff to do so.
signature.asc
Description: PGP signature