[bug #63074] [troff] need a way to embed non-Basic Latin glyphs in device control commands

G. Branden Robinson Mon, 26 Sep 2022 10:20:22 -0700

Update of bug #63074 (project groff):

                Severity:              3 - Normal => 1 - Wish               
              Item Group:     Incorrect behaviour => Feature change         
                  Status:                    None => Postponed              
                 Summary: warning messages when using special characters in
TITLE or AUTHOR => [troff] need a way to embed non-Basic Latin glyphs in
device control commands


    _______________________________________________________

Follow-up Comment #9:

Thanks for re-opening this, Dave.

Peter rightly observed that this is not a problem with mom(7).

It is an absent feature in the formatter itself.

I am going to quote in slightly edited form my earlier research to the _groff_
list on this issue.

https://lists.gnu.org/archive/html/groff/2022-09/msg00077.html


...this is our old friend "can't output node in transparent throughput".

...I recently disabled these diagnostics by default in groff Git.

Try regenerating the document with

  GROFF_ENABLE_TRANSPARENCY_WARNINGS=1

(actually, you can set the variable to anything) in your environment.

The problem, I think, is that PDF bookmark generation, like the `pdfinfo`
macro defined in _pdf.tmac_ to include document author and title
metadata, and maybe some other advanced features, relies upon use of
device control escape sequences.  That means '\X' stuff.

In device-independent output ("grout", as I term it), these become "x X"
commands, and the arguments to the escape sequence are, you'd think,
passed through as-is.

The trouble comes with the assumption people make about what "as-is"
means.

The problem is this: what if we want to represent a non-ASCII character
in the device control escape sequence?

groff's device-independent output is, up to a point, strictly ISO Basic
Latin, a property we inherited from AT&T troff.

Except in device control escape sequences.

We have the same problem with the requests that write to the standard
error stream, like `tm`.  I'm not sure that problem is worth solving;
groff's own diagnostic messages are not i18n-ed.  Even if it is worth
solving, teaching device control commands how to interpret more kinds of
"node" seems like a higher priority.

We don't have any infrastructure for handling any character encoding but
the default for input.  That's ISO Latin-1 for most platforms, but IBM
code page 1047 for OS/390 Unix (I think--no one who runs groff on such a
machine has ever spoken with me of their experiences).  And in practice
GNU troff doesn't, as far as I can tell, ever write anything but the 94
graphical code points in ASCII, spaces, and newlines to its output.

I imagine a lot of people's first instinct to fix this is to say, "just
give groff full Unicode support and enable input and output of UTF-8"!

That's a huge ask--bug #40720.

A shorter pole might be to establish a protocol for communication of
Unicode code points within device control commands.  Portability isn't
much of an issue here: as far as I know there has been no effort to
achieve interoperation of device control escape sequences among troffs.

That convention even _could_ be UTF-8, but my initial instinct is _not_
to go that way.  I like the 7-bit cleanliness of GNU troff output, and
when I've mused about solving The Big Unicode Problem I have given
strong consideration to preserving it, or enabling tricked-out UTF-8
"grout" only via an option for the kids who really like to watch their
chrome rims spin.  I realize that Heirloom and neatroff can both boast
of this, but how many people _really_ look at device-independent troff
output?  A few curious people, and the poor saps who are stuck
developing and debugging the implementations, like me.  For the latter
community, a modest and well-behaved format saves a lot of time.

Concretely, when I run the following command:

  GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 ./test-groff -Z -mom -Tpdf -pet \
  -Kutf8 ../contrib/mom/examples/mon_premier_doc.mom

I get the following diagnostics familiar to all who have build groff
1.22.4 from source.

troff:../contrib/mom/examples/mon_premier_doc.mom:28: error: can't
transparently output node at top level

(The foregoing is document metadata going wrong, tripping over the "é"
in "Cicéron".)

troff:../contrib/mom/examples/mon_premier_doc.mom:13: error: can't translate
character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:30: error: can't translate 
character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:108: error: can't translate

character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:136: error: can't translate

character code 232 to special character '`e' in transparent throughput

(These are section headings of the document being made into PDF bookmarks.
The headings that happen to be in plain basic Latin have no such trouble.)

More tellingly, if I page the foregoing output with "less -R", I see
non-ASCII code points screaming out their rage in reverse video.

x X ps:exec [/Author (Cic<E9>ron,) /DOCINFO pdfmark

x X ps:exec [/Dest /pdf:bm4 /Title (1. Les diff<E9>rentes versions) /Level 2 
/OUT pdfmark

x X ps:exec [/Dest /evolution /Title (2. Les <E9>volutions du Lorem) /Level 2

/OUT pdfmark

x X ps:exec [/Dest /pdf:bm8 /Title (Table des mati<E8>res) /Level 1 /OUT
pdfmark

It therefore appears to me that the pdfmark extension to PostScript, or
PostScript itself, happily processes Latin-1...but that means that it
accept _only_ Latin-1, which forecloses the use of Cyrillic code points.

I'm a little concerned that we're blindly _feeding_ the device control
commands characters with the eighth bit set.  It's obviously a useful
expedient for documents like mon_premier_doc.mom.  I am curious to know
why instead of getting no text for headings and titles in the Cyrillic
PDF outline, you didn't get horrendous mojibake garbage--but plainly
Latin-1 garbage at that.

Anyway, some type of mode switching or alternative notation within the
PostScript command stream is required for us to be able to encode
Cyrillic code points.

And once we've figured out what that is, maybe we can teach GNU troff
something about it.  The answer might be to do just whatever works for
PostScript and PDF, since I assume this problem has been solved already,
but it also might mean having our own escaping protocol, which the
output drivers then interpret.


I know of three places it would make sense to support the output of UTF-8, and
until I encounter a problem I see no reason not to employ the same solution
for all three.

1.  We need to be able to put multibyte UTF-8 sequences into PDFs.  That means
encoding them in _grout_ as "x X" device commands.  That in turn means being
able to encode them in `\X` escape sequences and `.device` requests.  `\Y` and
`.devicem` may have to wait for the resolution of bug #40720, since they will
require groff to be able to store UTF-8 code points internally.  Or maybe not,
since we already have \[uXXXX].

2. The `tm`, `tm1`, `tmc`, `ab`, and `rd` requests all write to standard
error.

3. The `cf`, `lf`, `nx`, `open`, `opena`, `psbb`, and `trf` requests all
expect to be able to express (to standard error, in the case of `lf`) or,
importantly, _open_ files by name from the filesystem.  Right now groff
doesn't have a story for being able to open UTF-8-encoded file names that use
continuation bytes.

I can think of two approaches to take.

A. Re-use Unicode named glyph notation \[uXXXX] in these contexts.  The
advantages are that we don't need to track any kind of shift state while
processing them (see below), the notation will be familiar to experienced
groff users, is already explained in groff_char(7) and our Texinfo manual, and
its purpose is deducible by novices who _haven't_ read the documentation.

B. We could employ a couple of C0 control characters that groff doesn't
already use internally, like STX and ETX (Control+B and Control+C) to shift in
and out of a "verbatim" mode where any bytes encountered between the shift
characters are given as-is to the next layer of the interface.  (So, they'd
appear in grout [case 1] or would end up in arguments to C library calls:
fprintf(stderr, ...) [case 2, and `lf` in case 3], and fopen() or similar [the
rest of case 3].

(A) has disadvantages.  One is that it's kind of an abuse of the special
character/named glyph notation; the whole point of these is that they _don't_
become formatted glyphs.  They are merely a way to encode integers.  Another
problem is that there's no obvious way to adapt this to any encoding _but_
UTF-8.  A weisenheimer can say that someone can always re-encode the output if
they need to, but that remedy is not available for the file-opening case.  I
take a dark view of advising groff users to write an open()-intercepting
wrapper in C to be used with LD_PRELOAD.

(B) is more flexible--more accommodating of other character encodings--and
seems more like the old-school Unix way to handle the problem, especially back
in the days of sundry character encodings warring for supremacy, but, at least
as I've sketched it, has the problem that you can't encode the ETX (Control+C)
character in the "verbatim region".  Eventually, that limitation will bite
someone.

I'm leaning toward (A) because the importance of all encodings other than
UTF-8 is dwindling.  But there is plenty of time to wrangle over this and for
brilliant new ideas to be pitched, since I see no prospect of this work being
undertaken for the groff 1.23 release.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?63074>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[bug #63074] [troff] need a way to embed non-Basic Latin glyphs in device control commands

Reply via email to