Re: Groff UTF-8 support? - Groff documentation section 5.1.9 Input Encodings

ropers Thu, 07 Mar 2024 17:15:32 -0800

On 07/03/2024, Dave Kemper wrote:
> Hi Ian, thanks for your attention to the groff manual!

Thank you very much, Dave, for your helpful and informative replies. :-)

> On 3/7/24, ropers <rop...@gmail.com> wrote:
>> "latin1" sounds awfully ISO-8859-1ish, and (I fear) not very much like
>> the Latin-1 Supplement Unicode block
>
> Correct.  Since there are two different things that include "Latin-1"
> in their name, perhaps this wording could be be more explicit.  On the
> other hand, the context is input encodings, and a Unicode block is not
> itself an input encoding.

It might be preferable to demine rather than rely on contextual hints
as to the presence of UXO:

$ diff -u groff.texi.orig groff.texi
--- groff.texi.orig     2024-03-05 18:20:59.940460376 +0000
+++ groff.texi  2024-03-08 00:21:12.782360544 +0000
@@ -5509,9 +5509,10 @@
 @cindex ISO @w{8859-1} (@w{Latin-1}), input encoding
 @cindex input encoding, @w{Latin-1} (ISO @w{8859-1})
 @pindex latin1.tmac
-ISO @w{Latin-1}, an encoding for Western European languages, is the
-default input encoding on non-@acronym{EBCDIC} platforms; the file
-@file{latin1.tmac} is loaded at startup.
+ISO 8859-1, aka @w{Latin-1}, an extended ASCII encoding chiefly for
+Western European languages, is still @code{groff}'s default input encoding on
+non-@acronym{EBCDIC} platforms; the file @file{latin1.tmac} is loaded
+at startup.
 @end table

 @noindent
@@ -5533,9 +5534,9 @@
 @cindex ISO @w{8859-2} (@w{Latin-2}), input encoding
 @cindex input encoding, @w{Latin-2} (ISO @w{8859-2})
 @pindex latin2.tmac
-To use ISO @w{Latin-2}, an encoding for Central and Eastern European
-languages, invoke @w{@samp{.mso latin2.tmac}} at the beginning of your
-document or supply @samp{-mlatin2} as a command-line argument to
+To use ISO 8859-2, aka @w{Latin-2}, an encoding for Central and Eastern
+European languages, invoke @w{@samp{.mso latin2.tmac}} at the beginning of
+your document or supply @samp{-mlatin2} as a command-line argument to
 @code{groff}.

 @item latin5
@@ -5544,8 +5545,8 @@
 @cindex ISO @w{8859-9} (@w{Latin-5}), input encoding
 @cindex input encoding, @w{Latin-5} (ISO @w{8859-9})
 @pindex latin5.tmac
-To use ISO @w{Latin-5}, an encoding for the Turkish language, invoke
-@w{@samp{.mso latin5.tmac}} at the beginning of your document or
+To use ISO 8859-5, aka @w{Latin-5}, an encoding for the Turkish language,
+invoke @w{@samp{.mso latin5.tmac}} at the beginning of your document or
 supply @samp{-mlatin5} as a command-line argument to @code{groff}.

 @item latin9
@@ -5554,9 +5555,9 @@
 @cindex ISO @w{8859-15} (@w{Latin-9}), input encoding
 @cindex input encoding, @w{Latin-9} (ISO @w{8859-15})
 @pindex latin9.tmac
-ISO @w{Latin-9} succeeds @w{Latin-1}; it includes a Euro sign and better
-glyph coverage for French.  To use this encoding, invoke @w{@samp{.mso
-latin9.tmac}} at the beginning of your document or supply
+ISO 8859-9, aka @w{Latin-9} succeeds @w{Latin-1}; it includes a Euro sign
+and better glyph coverage for French.  To use this encoding, invoke
+@w{@samp{.mso latin9.tmac}} at the beginning of your document or supply
 @samp{-mlatin9} as a command-line argument to @code{groff}.
 @end table

Внимание!
I have not actually previewed this!
Truth be told, info(1) is Greek to me.  I've tried
$ info groff.texi #, which made it say "Cannot find node 'Top'." at
the bottom (pun intended?), and then I couldn't figure out how to
actually view the groff info manual.  Not that I've tried much, but
still.
IMNSHO it is incredibly ironic, and--if one could hurt a program's
feelings--almost insulting for groff's manual to be maintained in info
format.  Not exactly dogfooding, no?  At the peril of slighting the
local champion, my opinions on info(1) reduce to <xkcd.com/912>, and I
suspect
$ info mcas
is a synonym for
$ kill -9 346 #,
and in light of his prescience, I remain unconvinced *Primer* wasn't
based on the exploits of one Randall Munroe + colleague.

>> which makes me wonder if Current Year's
>> groff/troff itself (absent pre-piped converters) can at all handle
>> multi-byte character sets in general, or UTF-8 in particular.
>
> It cannot.  This is a longstanding wishlist item: "improving Unicode
> support" was put into the Groff Mission Statement when it was drafted
> 10 years ago.  Ten years before that, groff's then-maintainer posted
> to this list: "Volunteers are highly welcome to extend groff from 8bit
> to 32bit input characters"

Based on my admittedly not quite unlimited insight into Unicode
issues, if taken literally, a mission statement "to extend groff from
8bit to 32bit input characters" strikes me as an already outmoded if
not stillborn strategy.  It might be much better to go all-in on
variable-width encoding, read: UTF-8, just like everybody else.
Whatever limited *strictly internal* use there may still be for UTF-32
in some buffers, structs or variables, anything not UTF-8 is probably
best kept to a minimum.

But perhaps I'm barking at shadows here.  Nothing in this
<https://lists.gnu.org/r/groff/2004-05/msg00074.html> is smoking-gun
evidence that would compel a jury of me, myself and I to conclude
Werner et al. WEREN'T aware of that already, or if not then, then
certainly now.

> (http://lists.gnu.org/r/groff/2004-05/msg00026.html).
>
> But this is a monumental task, and one groff developer has written of
> some of its difficulties
> (http://savannah.gnu.org/bugs/?40720#comment4).

I was a few paragraphs into that before I realised the author of the
above comment is Ingo Schwarze, an OpenBSD dev I've previously talked
to, and whose judgement on this I trust A LOT.

> In short, it's not for lack of desire that groff lacks this feature.
>
> With any luck, you'll follow the Branden Track, where you start off by
> poking a little at groff's documentation and are soon hacking away at
> the code base.  You might be the volunteer Werner asked for 20 years
> ago ;-)

Not to be a negative Nancy, but just to be straight with you and set
expectations:  Probably not.  Even if I, at long last, might yet prove
competent enough to make a significant contribution in code to the
open source community, I am less likely to make that to a GNU GPL
project -- I'm more of a BSD (ISC/OpenBSD) fan.  Of course, to my
understanding it's not BSD licenses that are incompatible with GPL
ones, so any contribution could still reach you regardless of
philosophical differences if not legalistic bikeshedding.

I really only dove into the groff manual thanks to an observed
(kernel.org) ascii(7) man page bug I only have a partial fix for,
which is why I'm still reading, all of which I'll possibly talk about
at a later date.

>> Also, this sounds a lot like Current Year's groff(1) even WITH
>> pipe-connected UTF-8 converters/drivers (which may be what's referred
>> to at the bottom of that section) couldn't actually support anything
>> like, say, Cyrillic or katakana or whatever,
>
> Groff added Cyrillic support last year
> (http://savannah.gnu.org/bugs/?63076).  It includes some CJK support
> but expanding this is an ongoing project
> (http://savannah.gnu.org/bugs/?62830).  If you have expertise in this
> realm and can address some of the outstanding questions in that
> ticket, please chime in.

I'm not totally ignorant of UTF-8 in particular, but depending on your
expectations, I'm possibly also not so hugely competent for the former
to be a massively modest understatement.

I will say that if anyone following along at home is struggling to get
their head around UTF-8, this post by Graham Douglas might be an
excellent starting point:
<http://www.readytext.co.uk/?p=1284>

Thanks and regards,
Ian

(Ian Ropers)

Re: Groff UTF-8 support? - Groff documentation section 5.1.9 Input Encodings

Reply via email to