[bug #40720] [UPGRADE] improve Unicode support

G. Branden Robinson Fri, 05 Dec 2025 01:00:20 -0800

Follow-up Comment #15, bug #40720 (group groff):

[comment #14 comment #14:]
> My comment #12 meant, perhaps too ambiguously, to refer to my remark
> in comment #5, "the above paragraphs seem to, effectively, entail
> integrating preconv into the part of groff that reads input," and
> Ingo's comment #6 response, "that is exactly what mandoc(1) has been
> doing for more than ten years."
> 
> Whether this conceptual plan has survived contact with the enemy, I
> don't know the code well enough to say.  Comment #9 implies perhaps
> not, but its code dump is over my head.  Ingo knows the code a lot
> better than I do, so I leave it to him to debate on the particulars.


Okay, let me try to address these comments specifically.

[comment #5 comment #5:]
> [comment #4 comment #4:]
>> how groff currently handles wide characters - support wide
>> characters both on the input and output side while keeping the
>> code simple by mostly using plain char[] strings internally - is
>> actually one good way for keeping wide character support
>> simple in some circumstances.
> 
> A good point.  Internally, groff can already encode Unicode input
> specified in \[uXXXX] format.

Yes, true.

> So to handle UTF-8 input natively, while reading input groff could
> convert any UTF-8 characters with the 8th bit set into whatever the
> current internal storage encoding is.

Well, the current internal storage encoding is `unsigned char`.  Your
proposal means, perhaps among other complications:

1.  Dealing with variable-length character sequences at every point
    internally where we want to iterate over "ordinary characters".

2.  Having to think damn hard about requests `length` and `chop` mean
    when an "ordinary" character is not only distinguishable from
    special and indexed characters, but may itself occupy from one to
    six bytes.

3.  The NFD blues.  See below.

Those sound to me like acts of masochism.  I'm happy to award
_mandoc_(1) masculinity points for bulling through those difficulties,
if in fact that's what it's done.

I'd prefer to just have a 32-bit internal data type for ordinary
characters, do UTF-8 sequence validation at most once per input
character, and never have to reëncode such a code point back to UTF-8,
because "grout" is an ISO 646/"ASCII" text file.[1]

> This would seem to localize the changes needed, rather than requiring
> altering data types throughout the code base.

I don't think _mandoc_(1) attempts to support as many fancy features as
we do.

Its _roff_(7) says:


     char glyph [string]
             Define or redefine the ASCII character or character escape
             sequence glyph to be rendered as string, which can be
             empty.  Only partially supported in mandoc(1); may interact
             incorrectly with tr.
...
     chop stringname
             Remove the last character from a macro, string, or
             diversion.  Currently unsupported.
...
     length register string
             Count the number of input characters in a string.
             Currently unsupported.


In sum, both approaches promise to be significant work, but migrating to
a 32-bit internal encoding also gives us practically unlimited room for
extending the variety of tokens available (because we can employ the
Private Use Area or even code points outside the valid Unicode range)
which in turn will make it easier to do reversible "asciifications" (see
bug #67744; if we can distinguish `\0`, `\|` and `\^` from `\h` as well
as each other, we give more power to advanced users of GNU
_troff_...though they're going to want the damn string iterator I've
been talking about for years, too).  We could also do something
regarding which our Texinfo manual currently compares us unfavorably to
TeX, and that is interning of macros.  With the improved `pm` macro
dumper under my belt, I now perceive the failure to specially tokenize
the escape and control characters as the only barrier to this ability.
(Maybe Werner knows of others I haven't thought of.)

Also, getting ourselves the hell out of the C0 and C1 control parts of
the encoding spaces (except for those code points for which we already
document explicit support) would be a virtuous thing to do.

>> the existing preconv(1) approach and its simplicity and modularity
>> has striking similarities to what is discussed here, and likely is a
>> good approach,
> 
> I wrote about a drawback of preconv itself in bug #58796 (comment 3).

I'm unable to locate a statement of drawback in that comment.  However,
I will observe that while GNU _troff_ **should** encode an input 'é' as
the composite special character '\[e aa]' today, it does not because its
precomposed form shows up in the Latin-1 extension, and GNU _troff_
thinks Latin-1 is special and shouldn't have to follow the same rules
as other encodings.  This is our old nemesis where we claim we want
Unicode Normalization Form D but then make an exception to that rule for
every single applicable Latin-1 extension code point.

_groff_char_(7):

     Unicode code points can be composed as well; when they are, GNU
     troff requires NFD (Normalization Form D), where all Unicode glyphs
     are maximally decomposed.  (Exception: precomposed characters in
     the Latin‐1 supplement described above are also accepted.  Do not
     count on this exception remaining in a future GNU troff that
     accepts UTF‐8 input directly.)  Thus, GNU troff accepts
“caf\['e]”,
     “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input
“café”.
     (Due to its legacy 8‐bit encoding compatibility, at present it also
     accepts “caf\[u00E9]” on ISO Latin‐1 systems.)



$ printf 'é\n.pline\n' | ~/groff-HEAD/bin/groff -kz 2>&1 | jq
[
  {
    "type": "output line start node",
    "diversion level": 0,
    "is_special_node": false
  },
  {
    "type": "glyph node",
    "diversion level": 0,
    "is_special_node": false,
    "special character": "'e"
  },
  {
    "type": "word space node",
    "diversion level": 0,
    "is_special_node": false,
    "hunits": 2500,
    "undiscardable": false,
    "is hyphenless breakpoint": false,
    "terminal_color": "default",
    "width_list": [
      {
        "width": 2500,
        "sentence_width": 2500
      }
    ],
    "unformat": false
  }
]


Grody.

> But the above paragraphs seem to, effectively, entail integrating
> preconv into the part of groff that reads input.

I would not put it that way, but if I'm understanding you, yes.  NFD
decomposition is another thing we can incorporate into the "one and
done" per-character input processing I spoke of above.  In the future,
when we read a valid UTF-8 sequence, we should NFD decompose it if
necessary and then store it as 1..n 32-bit Unicode code points (however
many are necessary to represent the base character and the stacked
diacritics).

I don't even want to think about the difficulties of defining what a
string/macro length, or even one code point unit (for `chop`ping) is,
when "just representing characters as UTF-8 internally".  Given the
problems I've adumbrated above, it sounds to me like Hell on wheels.

My principle is: get to a place that is well-defined and then never,
ever leave.

> This seems more robust than keeping it as a standalone utility, which
> brings up problems like bug #59442.

I don't expect to get rid of _preconv_ for quite a while.  For one
thing, we'll need to interoperate with our own historical releases for a
period of some years.  For another, if I actually *fix* #58796 (or
someone else does--hope springs eternal), it'll be handy for
interoperation with at least some other _troff_s, particularly DWB 3.3
and System V, which haven't changed much since the 1990s.

Another advantage to _preconv_ that would seem to pay dividends into the
indefinite future would be its _opportunistic_ (at build time) support
for _uchardet_ and _iconv_.  If Ingo builds _groff_ without these
optional dependencies, then, yeah, _preconv_'s capability set looks
pretty lean.  But support for oodles and oodles of alternative character
encodings?  Heuristic guessing thereof from document content?  Those are
things I'm damned happy to leave as other people's problems while still
offering our users an easy way to get at their solutions (`groff -k`).

> And preconv is unique among groff's preprocessors in that its output
> is almost never of interest to humans.  Looking at the groff code
> emitted by tbl, pic, et al., can be instructive.  Looking at
> preconv-ed UTF-8 text is rarely preferable to looking at the original
> UTF-8.

Fully agreed; that doesn't mean it won't be of some use in special
cases, enumerated (partially?) above.  In that regard it can join
_groff_ tools like _afmtodit_ and _indxbib_ in a remote, lightly
trafficked corner of our department store.

Regards,
Branden

[1] Deri has expressed something approaching horror at my ambition to
    keep _grout_ ASCII.  I haven't wavered from that conviction, but
    something that _would_ be handy would be to suffix the 'c' commands
    that direct the typesetting of special characters, when a special
    character's identifier _in grout_ is of the form 'uXxxxx', with a
    _grout_ comment consisting of the code point transformed to UTF-8.


$ printf '₥\n' | ~/groff-HEAD/bin/groff -kZ -T utf8 | grep C
Cu20A5


So, notionally:

$ printf '₥\n' | ~/groff-FUTURE/bin/groff -kZ -T utf8 | grep C
Cu20A5 # ₥


    That's the only circumstance under which I feel totally copasetic
    about GNU troff _producing_ UTF-8.  It demands nothing of output
    drivers, since they should be ignoring everything after a comment
    character until the next newline.

    GNU _troff_ itself struggles with this; you _should_ be able to put
    C1 controls at least inside *roff comments without the formatter
    getting up in your business, but that's not presently the case.


$ ~/groff-1.23.0/bin/groff -w input -m an -z
./build/src/devices/grohtml/grohtml.1
troff:./build/src/devices/grohtml/grohtml.1:210: warning: invalid input
character code 154
$ sed -n '210p' ./build/src/devices/grohtml/grohtml.1 
.\" XXX: Exception: ˚.  Why?


    I can make resolving this a goal for 1.24 if people think it's
    important.  (It shouldn't be hard.)   Since we won't need
    "transparent" passage of C0 controls that are invalid as input for
    compatibility with UTF-8 in the future, I propose to let GNU _troff_
    keep complaining about them, even in comments.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?40720>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #40720] [UPGRADE] improve Unicode support

Reply via email to