Follow-up Comment #5, bug #40720 (group groff): [comment #4 comment #4:] > how groff currently handles wide characters - support wide > characters both on the input and output side while keeping the > code simple by mostly using plain char[] strings internally - is > actually one good way for keeping wide character support > simple in some circumstances.
A good point. Internally, groff can already encode Unicode input specified in
\[uXXXX] format. So to handle UTF-8 input natively, while reading input groff
could convert any UTF-8 characters with the 8th bit set into whatever the
current internal storage encoding is. This would seem to localize the changes
needed, rather than requiring altering data types throughout the code base.
> the existing preconv(1) approach and its simplicity and
> modularity has striking similarities to what is discussed here,
> and likely is a good approach,
I wrote about a drawback of preconv itself in bug #58796 (comment 3). But the
above paragraphs seem to, effectively, entail integrating preconv into the
part of groff that reads input. This seems more robust than keeping it as a
standalone utility, which brings up problems like bug #59442.
And preconv is unique among groff's preprocessors in that its output is almost
never of interest to humans. Looking at the groff code emitted by tbl, pic,
et al., can be instructive. Looking at preconv-ed UTF-8 text is rarely
preferable to looking at the original UTF-8.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?40720>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
