Update of bug #63074 (project groff): Severity: 3 - Normal => 1 - Wish Item Group: Incorrect behaviour => Feature change Status: None => Postponed Summary: warning messages when using special characters in TITLE or AUTHOR => [troff] need a way to embed non-Basic Latin glyphs in device control commands
_______________________________________________________ Follow-up Comment #9: Thanks for re-opening this, Dave. Peter rightly observed that this is not a problem with mom(7). It is an absent feature in the formatter itself. I am going to quote in slightly edited form my earlier research to the _groff_ list on this issue. https://lists.gnu.org/archive/html/groff/2022-09/msg00077.html ...this is our old friend "can't output node in transparent throughput". ...I recently disabled these diagnostics by default in groff Git. Try regenerating the document with GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 (actually, you can set the variable to anything) in your environment. The problem, I think, is that PDF bookmark generation, like the `pdfinfo` macro defined in _pdf.tmac_ to include document author and title metadata, and maybe some other advanced features, relies upon use of device control escape sequences. That means '\X' stuff. In device-independent output ("grout", as I term it), these become "x X" commands, and the arguments to the escape sequence are, you'd think, passed through as-is. The trouble comes with the assumption people make about what "as-is" means. The problem is this: what if we want to represent a non-ASCII character in the device control escape sequence? groff's device-independent output is, up to a point, strictly ISO Basic Latin, a property we inherited from AT&T troff. Except in device control escape sequences. We have the same problem with the requests that write to the standard error stream, like `tm`. I'm not sure that problem is worth solving; groff's own diagnostic messages are not i18n-ed. Even if it is worth solving, teaching device control commands how to interpret more kinds of "node" seems like a higher priority. We don't have any infrastructure for handling any character encoding but the default for input. That's ISO Latin-1 for most platforms, but IBM code page 1047 for OS/390 Unix (I think--no one who runs groff on such a machine has ever spoken with me of their experiences). And in practice GNU troff doesn't, as far as I can tell, ever write anything but the 94 graphical code points in ASCII, spaces, and newlines to its output. I imagine a lot of people's first instinct to fix this is to say, "just give groff full Unicode support and enable input and output of UTF-8"! That's a huge ask--bug #40720. A shorter pole might be to establish a protocol for communication of Unicode code points within device control commands. Portability isn't much of an issue here: as far as I know there has been no effort to achieve interoperation of device control escape sequences among troffs. That convention even _could_ be UTF-8, but my initial instinct is _not_ to go that way. I like the 7-bit cleanliness of GNU troff output, and when I've mused about solving The Big Unicode Problem I have given strong consideration to preserving it, or enabling tricked-out UTF-8 "grout" only via an option for the kids who really like to watch their chrome rims spin. I realize that Heirloom and neatroff can both boast of this, but how many people _really_ look at device-independent troff output? A few curious people, and the poor saps who are stuck developing and debugging the implementations, like me. For the latter community, a modest and well-behaved format saves a lot of time. Concretely, when I run the following command: GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 ./test-groff -Z -mom -Tpdf -pet \ -Kutf8 ../contrib/mom/examples/mon_premier_doc.mom I get the following diagnostics familiar to all who have build groff 1.22.4 from source. troff:../contrib/mom/examples/mon_premier_doc.mom:28: error: can't transparently output node at top level (The foregoing is document metadata going wrong, tripping over the "é" in "Cicéron".) troff:../contrib/mom/examples/mon_premier_doc.mom:13: error: can't translate character code 233 to special character ''e' in transparent throughput troff:../contrib/mom/examples/mon_premier_doc.mom:30: error: can't translate character code 233 to special character ''e' in transparent throughput troff:../contrib/mom/examples/mon_premier_doc.mom:108: error: can't translate character code 233 to special character ''e' in transparent throughput troff:../contrib/mom/examples/mon_premier_doc.mom:136: error: can't translate character code 232 to special character '`e' in transparent throughput (These are section headings of the document being made into PDF bookmarks. The headings that happen to be in plain basic Latin have no such trouble.) More tellingly, if I page the foregoing output with "less -R", I see non-ASCII code points screaming out their rage in reverse video. x X ps:exec [/Author (Cic<E9>ron,) /DOCINFO pdfmark x X ps:exec [/Dest /pdf:bm4 /Title (1. Les diff<E9>rentes versions) /Level 2 /OUT pdfmark x X ps:exec [/Dest /evolution /Title (2. Les <E9>volutions du Lorem) /Level 2 /OUT pdfmark x X ps:exec [/Dest /pdf:bm8 /Title (Table des mati<E8>res) /Level 1 /OUT pdfmark It therefore appears to me that the pdfmark extension to PostScript, or PostScript itself, happily processes Latin-1...but that means that it accept _only_ Latin-1, which forecloses the use of Cyrillic code points. I'm a little concerned that we're blindly _feeding_ the device control commands characters with the eighth bit set. It's obviously a useful expedient for documents like mon_premier_doc.mom. I am curious to know why instead of getting no text for headings and titles in the Cyrillic PDF outline, you didn't get horrendous mojibake garbage--but plainly Latin-1 garbage at that. Anyway, some type of mode switching or alternative notation within the PostScript command stream is required for us to be able to encode Cyrillic code points. And once we've figured out what that is, maybe we can teach GNU troff something about it. The answer might be to do just whatever works for PostScript and PDF, since I assume this problem has been solved already, but it also might mean having our own escaping protocol, which the output drivers then interpret. I know of three places it would make sense to support the output of UTF-8, and until I encounter a problem I see no reason not to employ the same solution for all three. 1. We need to be able to put multibyte UTF-8 sequences into PDFs. That means encoding them in _grout_ as "x X" device commands. That in turn means being able to encode them in `\X` escape sequences and `.device` requests. `\Y` and `.devicem` may have to wait for the resolution of bug #40720, since they will require groff to be able to store UTF-8 code points internally. Or maybe not, since we already have \[uXXXX]. 2. The `tm`, `tm1`, `tmc`, `ab`, and `rd` requests all write to standard error. 3. The `cf`, `lf`, `nx`, `open`, `opena`, `psbb`, and `trf` requests all expect to be able to express (to standard error, in the case of `lf`) or, importantly, _open_ files by name from the filesystem. Right now groff doesn't have a story for being able to open UTF-8-encoded file names that use continuation bytes. I can think of two approaches to take. A. Re-use Unicode named glyph notation \[uXXXX] in these contexts. The advantages are that we don't need to track any kind of shift state while processing them (see below), the notation will be familiar to experienced groff users, is already explained in groff_char(7) and our Texinfo manual, and its purpose is deducible by novices who _haven't_ read the documentation. B. We could employ a couple of C0 control characters that groff doesn't already use internally, like STX and ETX (Control+B and Control+C) to shift in and out of a "verbatim" mode where any bytes encountered between the shift characters are given as-is to the next layer of the interface. (So, they'd appear in grout [case 1] or would end up in arguments to C library calls: fprintf(stderr, ...) [case 2, and `lf` in case 3], and fopen() or similar [the rest of case 3]. (A) has disadvantages. One is that it's kind of an abuse of the special character/named glyph notation; the whole point of these is that they _don't_ become formatted glyphs. They are merely a way to encode integers. Another problem is that there's no obvious way to adapt this to any encoding _but_ UTF-8. A weisenheimer can say that someone can always re-encode the output if they need to, but that remedy is not available for the file-opening case. I take a dark view of advising groff users to write an open()-intercepting wrapper in C to be used with LD_PRELOAD. (B) is more flexible--more accommodating of other character encodings--and seems more like the old-school Unix way to handle the problem, especially back in the days of sundry character encodings warring for supremacy, but, at least as I've sketched it, has the problem that you can't encode the ETX (Control+C) character in the "verbatim region". Eventually, that limitation will bite someone. I'm leaning toward (A) because the importance of all encodings other than UTF-8 is dwindling. But there is plenty of time to wrangle over this and for brilliant new ideas to be pitched, since I see no prospect of this work being undertaken for the groff 1.23 release. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?63074> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/