Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

G. Branden Robinson Sun, 30 Apr 2023 06:36:15 -0700

At 2023-04-29T21:38:53-0500, Dave Kemper wrote:
> On 4/29/23, Oliver Corff <oliver.co...@email.de> wrote:
> > Would it be a feasible option to use UTF-8 throughout the inner
> > workings of a future groff,


I'm going to phrase this more confrontationally than it needs to be just
to make a point about software design:

It's none of your business what data type groff uses for characters in
its _inner workings_.

Of course I mean that purely from the software-architectural
perspective.  There is no reason for anyone except groff's developers to
care what primitive data type groff uses for this purpose as long as it
behaves correctly and is performant.  The whole point of encapsulation
is to keep other software modules from having to worry about this sort
of thing.

In another sense, it's totally your business and you can look at the
implementation at any time--it's Free Software.  But other software,
including parts of groff that are not GNU troff, the formatter, should
keep its dirty nose out, and expect to be excluded through
language-imposed visibility restrictions (or the impermeable wall of the
Unix process structure).

We absolutely want good UTF-8 support at the _edges_ of the system.  We
want to change GNU troff to cheerfully and correctly interpret UTF-8
input.  And we want output drivers that target devices using UTF-8 as a
character encoding to reliably produce it.

But that's all.

> This is the topic of http://savannah.gnu.org/bugs/?40720
[...]
> But in my opinion, the discussion is somewhat academic given the scope
> of the task and the number of current groff developers familiar with
> core parts of the code.

My idea for the initial scope is <cough> small.  I'm not convinced that
the groff string class is sealed as tightly as it should be.  So when I
take a second crack at changing its internal data type (my first was 2
years or so ago), I need to review it carefully.

From what I've seen the main point of interface we're concerned with is
its `contents` member function, which does in fact return a pointer to a
narrow character.

Possibly that needs to be renamed `as_c_string`, and existing uses of
`contents` audited to verify that they really do need a C string there,
or if they wouldn't work just as well dealing with something else.

Our diagnostic message functions (`fatal`, `error`, `warning`, `debug`
and friends) _do_ expect C strings.  I don't see that changing, since
their next stop is the standard error stream.

As part of this I also need to look over the ISO C++98 string class and
see how much sense just to make groff's string class a
basic_string<char32_t>.[1]

A rough sketch of my plan is this:

1.  Ensure that the groff string class is well-encapsulated.
2.  Change the internal type, and constructors and output functions
    only, to perform is transformation on this new type.
3.  Verify that nothing broke.  (If I did 1 and 2 correctly, nothing
    will.)
4.  Remap the code points we're squatting on.  Haven't decided yet
    whether to map them to illegal Unicode code points or to the Unicode
    Private Use Area.  With a char32_t we have all the room in the
    world.
5.  Drop code page 1047 support, per recent discussions with Mike Fulton
    of IBM on this list.
6.  Start not merely accepting, but _assuming_ UTF-8 input, because we
    won't misinterpret C1 controls anymore.

If that doesn't sound like enough work--at some point in the above, each
and every preprocessor has to be checked to ensure it isn't screwing up
the input before it gets to the formatter.

I don't see getting rid of preconv(1) in the near term.  It will remain
useful, particularly if I add the couple of small features I had in mind
for it.  It may continue to play a role in getting input into the
correct Unicode Normalization Form (D).  It might make sense to leave
that business out of the formatter proper.

Regards,
Branden

[1] std::u32string is C++11, and thus not available according to the
    portability horizon we have.  But we can make our own
    basic_string<char32_t> with C++98 facilities and gnulib's 'inttypes'
    module.  Hooray, templates!  ;-)

signature.asc
Description: PGP signature

Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

Reply via email to