At 2023-10-14T20:51:27-0600, Antonio Russo wrote: > I discovered a new pet peeve today: if you search for a command in a > manual page, say -e in man 1 zgrep, it's a crapshot whether just > searching for '-e' will find the command or not. The reason is that > "-" may been accidentally encoded as ‐ instead of -.
You can blame me for this. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1.23.0#n206 ...me, and man page authors who don't think about whether they intend a hyphen or a minus sign when they strike the '-' key... Quick background: in the context of Unix usage as documented by nroff/troff, the dash used at the shell prompt, in text editors, and in programming language source code is a "minus sign". troff has an em dash special character as well since the mid-1970s; groff adds an en dash as well, and furthermore supports user definition of characters providing access to any other sort of dash that comes down the Unicode pike. (Not that doing so is a good idea in a man page; see below regarding a "restricted dialect" of man(7).) > Now, depending on your email client and settings, the above will > appear to be the ravings of an unhinged lunatic who wrote the same > thing twice, or an unhinged lunatic who slammed their fists onto the > keyboard. This issue does indeed have a history of provoking unhinged lunacy. Before we proceed, you might wish to be aware of <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1041731> and its proposed remedy. > The reason is that man(1) convert bare dashes (0x2D) to hyphens > (U+2010). These are not the same symbol: searching for one does not > find the other without some kind of normalization, pasting commands > with one vs. the other does different things. New users who do not > understand this will be discouraged trying to read manual pages. > Chances are, they will fill forums with mundane questions that could > and should have been addressed by a simple search of a manual page. I run into this problem, too, since I dogfood my own changes. When irritated by this, I try the search again, replacing '-' with '.', which has yet to fail me (and produces false positives surprisingly rarely). For example, I've recently been playing with the mg(1) editor, and observed extremely poor discipline in this area. So I forked it on GitHub and have been preparing a bunch of revisions. I wrote a sed script to fix its numerous hyphen/dash problems.[1] > I recently fixed a ton of these in another upstream package with this > vim "one-liner": > > :%s/--\([a-z]\+\)\(-[a-z]\+\)*/\=substitute(submatch(0), '-', '\\-', 'g')/g My Vimscript is not very sophisticated, but it looks like you're replacing only hyphens that appear in long option names here. That's good, as you're unlikely to clobber any hyphens that should _not_ become minus signs. Such discernment is important. Many people who want to "solve" this issue forget (or ignore) that not every '-' is a minus sign. Some are actual hyphens, as in "long-term effects" and "word-aligned struct members". Trying to infer a distinction from white space adjacency also won't work. Consider the phrases "word- or byte-sized caching" and "object-based vs. -oriented programming". While sophistication with compound hyphenated affixes is seldom seen in man pages, we most often find it where a man page author has taken considerable care with their technical writing. Such pages are less likely than most to require revision with blunt instruments like regular expression-based global search and replace operations. > However, this requires manual review Surprisingly often, the composition of high-quality technical documentation requires the engagement of a human brain. > and does not fix the '-e' example from zgrep. Mapping all hyphens and minus signs to a single character, as people whose blood pressure spikes over this issue tend to promote as a first resort, is an ineluctably information-discarding operation. In my opinion, man page source documents are not the correct place to discard that information. (I acknowledge that you didn't propose such a crude remedy; I write to anticipate the inevitable follow-ups from people who will.) Doing so at rendering time is much more defensible, and happens anyway on devices that do not distinguish these characters in the first place. > There are also a whole host of this kind of problem, e.g., dashes in > URLs that get naievely pasted into man pages (another live example I > just addressed). Yes, people commonly type URLs and email addresses into man page sources as they would into an MUA or browser navigation bar. Since U+2010 is difficult to encode in such things, the man(7) package could help by performing an automatic character translation in this area. However, (1) no one's actually asked for this and (2) it would address only a tiny part of the problem. The means of "help" I have in mind is employment of the groff man(7) extension macros `UR`/`UE` and `MT`/`ME`, which remain under-used even after 14 years in release. I might like to think that offering such a provision would encourage their adoption, but I can't honestly adopt that position. I don't see another good way to perform the transformation, because these are "semantic" macros imputing enough meaning to the material they bracket that we know we can safely do so. > I come here with several questions: > > - Am I off-base thinking this is a problem? No--it's a problem, but I might not locate it in the same place(s) you do. > - Should we really be using troff to typeset anything in this year > 2023? I'm conflicted out on this question.[2] Keep in mind that the distinction between hyphens and minus signs is actually _important_ to people doing _typesetting_, as opposed to reading man pages on terminals, perhaps in haste and under deadline stress in a workplace. > (In particular, if we can make the source text more human-readable, > we might be able to leverage LLMs on this wealth of information in > the future and automate support. Are LLMs "fluent" in troff? I > have not investigated at all.) I am not an expert in LLMs, but man(7) is a macro package for the roff(7) language, and roff(7) is Turing-complete. Thus, in principle, to know even what is being rendered as text, one is faced with a challenging decidability problem. (This is why "deroff" and "unroff" tools confess their limitations, and seem always to fall out of use. Also "groff -a" is very nearly what you want anyway.) On the bright side, mandoc(1) maintainer Ingo Schwarze and I have put considerable effort into defining and promoting a restricted dialect of man(7) that is much more amenable to automated processing of all kinds. That we do so for different reasons (he maintains a bespoke *roff interpreter and wants to implement as few features as possible; he also strongly advocates use of the mdoc(7) macro package over man(7); I on the other hand want the language of man page composition to be as small as possible to ease its acquisition and mastery while getting a few nines of the task done) fortunately doesn't frustrate our cooperation. The groff_man_style(7) man page in the version of groff to which you recently upgraded is the fruit of much effort in this area. > - Are there any alternatives that actually produce nice looking man > pages? Many tools produce acceptable looking man pages when _rendered_ (depending on your standard of good typography). The production of man(7) source that is idiomatic enough to be maintained in that form, or even comprehended well enough to drive debugging and development of the conversion tool, is another question. Perl's pod2man/podlators is probably the best of breed here, still does not match the cleanliness of a document drafted by a human author with a good command of the macro language. At the other end of the spectrum is docbook-to-man, which seems to be reviled not only by every practitioner of roff/man that encounters it, but which also seems to poison everyone who attempts to maintain it. > (My experience with pandoc is that the source is still awkward, I > literally just found another example of this bug in my own man > page, and it looks pretty ugly in man. But maybe I just didn't find > good examples/documentation.) pandoc has recently seem some improvements in its man(7) generation. I've worked fruitfully with its upstream in this area. Feel free to Cc me with respect to any further revisions you'd like to pursue there. > - Should we try to come up with some lintian rules to flag this > behavior? (This one: /--\([a-z]\+\)\(-[a-z]\+\)*/ finds long > GNU-style commands, I'd have to think for at least a little bit > about finding short ones. This would ultimately be fragile. For > example, the above doesn't find partially broken tokens; i.e., only > one unescaped dash.) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051357 > <li> Automated tooling around this, more generally, seems fragile. > HTML might have been a nice compromise, but writing that appears to > be out of vogue these days, <sarcasm intensity="medium">despite > being a pretty OK thing to read and write by hand</sarcasm>.</li> > But seriously, I would love to be writing HTML instead of troff for > manual pages. If you want man pages to look the way they traditionally have since Unix Version 7 (1979), this is a bigger challenge to achieve with HTML than you might suppose. If you want only a rough approximation thereof, my guess is that there are many straightforward and valid approaches one could take. The challenge would then be in persuading others to adopt your one obviously optimal solution. https://xkcd.com/927/ Those who want to know why this exasperating issue arose in the first place, I refer to section "History" of groff_char(7). The arrival of the Unicode character set in terminal emulators echoed the delivery of the Graphic Systems C/A/T phototypesetter to the Bell Labs Computing Science Research Center in about 1972; the problems that came along were similar. Regards, Branden [1] Here's part of one commit message. I haven't pushed any commits to my mg fork yet. Long story short, that man page has a lot of problems even apart from this one, both from a technical writing perspective and from that of mdoc(7) competency, which I find noteworthy in light of the stridency of *BSD community partisanship on the question of man(7) vs. mdoc(7). But, having met Charles Hannum (a NetBSD founder) in person at the Atlanta Linux Showcase nearly 25 years ago, I can't say wasn't prepared. Also, I did not bother to tune this sed script for efficiency, cleverness, or to show off my command of the language.[3] I did not undertake it for its own sake. I built it up by whacking at errors until none remained. I share this to illustrate the impotence of a crude approach to solving this problem. For example, the character sequence "read-only" is sometimes used in prose as an adjective and sometimes as an Emacs command literal. The former should keep a hyphen; the latter should get a dash. Deciding which one demands a higher climb up the Chomsky language hierarchy than a text editor generally offers. The solution exists between the keyboard and chair, but I guess that's where the resentment of solving it at all arises too. --begin snip-- I produced the change with the following sed script. This process exposed many failures to use the mdoc `Ic` macro when it was warranted; had it been employed with discipline, this script would be shorter. /^\.Nd/b /^\.Bl/b /^\.Bd/b \# skip exceptions /opened read-only/b /window-specific/b /buffer-specific/b /working-directory/b /non-incremental/b /are read-only/b /are self-explanatory/b /extended-ascii/b /two-line/b /Set case-fold/b /mini-buffer/b /Toggle the read-only/b /global read-only/b /terminal-specific/b /8-bit/b /Multi-byte/b s/-/\\-/g \# put these back s/Control\\/Control/g s/Meta\\/Meta/g s/an auto\\-execute/an auto-execute/ s/Toggle auto\\-fill/Toggle auto-fill/ s/mail\\-mode/mail-mode/ s/non\\-whitespace/non-whitespace/ s/Self\\-insert/Self-insert/ s/KNF\\-compliant/KNF-compliant/ s/keyboard\\-invoked/keyboard-invoked/ --end snip-- [2] https://www.gnu.org/software/groff/manual/groff-man-pages.pdf [3] On a positive and cool note, the following remains unsurpassed, to my knowledge as the coolest, cleverest thing ever done in sed(1). https://sed.sourceforge.io/local/scripts/dc.sed.html (Now that I've said that, someone can tell me that they've implemented an RV32E core in sed...)
signature.asc
Description: PGP signature