At 2023-10-15T10:01:20-0700, Russ Allbery wrote: > I think my position at this point as pod2man maintainer (not yet > implemented in podlators) is that every occurrence of - in POD source > will be translated into \-, rather than using the current heuristics, > and people who meant to use ‐ should type it directly in the POD > source. pod2man now supports Unicode fairly well and will pass that > along to *roff, which presumably will do the right thing with it after > character set translation.
It will, as long as something (like preconv(1)) translates the UTF-8 into something GNU troff can understand. One of the most painful decisions James Clark made was to follow AT&T's example and use "char" as the fundamental character type, instead of throwing his elbows with an "int" (or better yet, an int-sized C++ type, since C++ had real type checking in 1989, while K&R C was still in vogue and scoffed at such gratuities).[1] I took a stab at changing this about 3 years ago but it was too big a bite. I didn't know enough yet about how the formatter worked. If I have n months to set aside I suspect I can get it done on a second attempt. Anyway, to illustrate. (UTF-8 follows.) $ for n in $(seq 8); do printf 'abc\\[u2010]defgh '; done | nroff | cat -s abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐ defgh abc‐defgh > Currently, pod2man uses an extensive set of heuristics, but I think > this is a lost cause. I cannot think of any heuristic that will > understand that the - in apt-get should be U+002D (so that one can > search for the command as it is typed), but the - in apt-like should > be aptlike, since this is an English hyphenated expression talking > about programs that are similar to apt. This is simply not > information that POD has available to it unless the user writing the > document uses Unicode hyphens. Yes. This is the same point I was trying to make with my mg(1) man page example. > I believe the primary formatting degredation will be for very long > hyphenated phrases like super-long-adjectival-phrase-intended-as-a- > joke, because *roff will now not break on those hyphens that have been > turned into \-. People will have to rewrite them using proper Unicode > hyphens to get proper formatting. Even that can be overcome. You can tell groff that a line can be broken after a minus sign. But I'm going to stone-facedly require people to RTFM for that. The character remapping in the PROBLEMS file is the prescribed band-aid for those who can't or don't care to fix bad typography in man pages, and I'd prefer not to see additional cargo cult techniques piled on top of it. https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?h=1.23.0#n82 Regards, Branden [1] Just like the omission of bounds checks on array types. What a brilliant efficiency that was. Jean Ichbiah saw Dennis Ritchie coming a mile away in the 1970s, and Ada 83 did the right thing--in countless respects. Compiler authors squealed like pigs in hot oil at the idea of doing any amount of static analysis of input--this is back when compilers would not _automatically_ pass anything in registers at all (_everything_ hit the stack) and common subexpression elimination was regarded as a state-of-the-art optimization--and spent over a decade slandering Ada's name in every forum available to them. Nowadays, static analysis is cool and compiler engineers make big, big bucks developing its techniques professionally. And I'll bet you those who have even heard of Ada still turn their noses up at it. Stick around, and I'll share the secret legacy of the hated IA-64...
signature.asc
Description: PGP signature