[bug #40720] [UPGRADE] improve Unicode support

Ingo Schwarze Sun, 08 Jun 2025 19:05:52 -0700

Follow-up Comment #6, bug #40720 (group groff):

[comment #5 comment #5:]
> Internally, groff can already encode Unicode input specified in \[uXXXX]
> format.  So to handle UTF-8 input natively, while reading input groff could
> convert any UTF-8 characters with the 8th bit set into whatever the current
> internal storage encoding is.  This would seem to localize the changes
> needed, rather than requiring altering data types throughout the code base.


Incidentally, this is exactly what mandoc(1) has been doing since 2014, i.e.
for more than ten years now.  In mandoc(1), the internal format is "char *",
optionally containing \[uXXXX] escape sequences, just like in groff.  The only
difference is that, IIRC, groff interprets these strings as ISO-LATIN-1
internally, whereas mandoc(1) definitely interprets them as US-ASCII and
converts even ISO-LATIN-1 to \[u00XX] on input.

> But the above paragraphs seem to, effectively, entail integrating preconv
> into the part of groff that reads input.

Again, that is exactly what mandoc(1) has been doing for more than ten years.

> And preconv is unique among groff's preprocessors in that its output is
> almost never of interest to humans.

Yes.  I think the main reason for inventing preconv(1) in the first place may
have been that ROFF already was a system heavily relying on pipelines, so a
stand-alone filter program may have seemed natural back in the day.  Making
all of groff monolithic in the same way as mandoc(1) is monolithic would
probably not be a good idea, but i'm not sure this particular task is best
served by a filter program.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?40720>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #40720] [UPGRADE] improve Unicode support

Reply via email to