Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian

G. Branden Robinson Tue, 25 Apr 2023 09:51:44 -0700

Hi Oliver,

At 2023-04-25T16:25:49+0200, Oliver Corff wrote:
> In the meantime, I had a look at that Russian hyphenation file, and to
> my relief, the structure of the groff hyphenation pattern files is
> that of TeX hyphenation pattern files, which I have worked on before.


Yup.  They were born that way.

> But... the hyphenation file hyphen.ru in the aforementioned source is
> not usable in the current set-up because the Russian syllable
> fragments are encoded in KOI-8, an 8 bit encoding based on a GOST
> Standard of the USSR.
> 
> So, it does not match the internal code representation of Unicode code
> points.

No, it doesn't.  But some of the other hyphenation pattern files don't,
either; if you look you will see that they're encoded variously in ISO
646, ISO 8859-1, ISO 8859-2, and ISO 8859-15.

This is because groff's hyphenation pattern file parser doesn't
understand UTF-8.

That would be a nice thing to have.

hyphen.ru does a very sneaky thing that I did not think was possible
before Nikita Ivanov dropped it on our doorstep and I took a closer look
at the KOI8-R encoding.

You might know that code points in the "C1 Controls" block of Unicode
(U+0080..U+009F) are invalid input characters to groff.  groff uses them
for internal, bespoke purposes.[1]  This is a barrier to making groff
support UTF-8 input directly, as noted in our documentation.[2][3]

But an interesting property of KOI8-R is that none of the glyphs it
heaps up in the C1 region are alphabetic.

Therefore they don't require hyphenation.

Therefore the Russian hyphenation patterns, using KOI8-R, can masquerade
effectively as an ISO 8859 encoding.

This is the same deal that lets us support ISO 8859-{2,15} in our
hyphenation patterns.  GNU troff doesn't actually care what these code
points "are", it only needs to know their values to make hyphenation
decisions.  The intelligibility of the hyphenation patterns to a human
reader is determined by the character encoding, but within the range
U+00A0..U+00FF (actually more than that: U+0021..U+007F as well), groff
has no dog in the semantic interpretation fight.

> Since groff internally seems to work with Unicode code positions, the
> question is: in which format should the hyphenation patterns be
> presented to groff? As-is, that is as utf8 text, or in \[u04xx] form?
> That does not seem to work either, according to my last experiment.

For now, neither; the KOI8-R cheat seems to work fine, as far as I can
tell or understand.  Admittedly, I'm not a Russian speaker.  But I
believe the contributor is.

Eventually, we will need a way for our hyphenation pattern file reader
function[6] to interpret UTF-8 input.  The cleanest thing to do would be
to have it use the same facility as regular GNU troff input stream
reading support for UTF-8.  But that has to be written first.

Regards,
Branden

[1] https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h

[2] https://www.dropbox.com/sh/17ftu3z31couf07/AAC_9kq0ZA-Ra2ZhmZFWlLuva?dl=0

    Open groff.YYYY-MM-DD.pdf, where the date changes from time to time;
    see pages 73 and 84 (as of this writing).

[3] It can be done; it's just harder than migrating from ASCII to UTF-8.
    My idea is to relocate all these bespoke groff symbols to the
    Unicode Private Use Area.  But for that we need to change groff's
    string class[4] to build upon either (1) wide characters or (2)
    multibyte characters.  My preference is to go straight to char32_t.

[4] groff, having been first written in about 1989, does not use the
    Standard C++ library string class.  This has proven unproblematic;
    it's implemented well and I'm not aware of any defect _ever_ being
    exposed in it.  (This illustrates that James Clark was a better C++
    programmer than most.)  But if I change it, someone's going to ask
    me why I don't just migrate to Standard C++ library facilities for
    it and I need a good answer.  I'm working on that.  When defending
    my engineering decisions, I prefer to be equipped with stone tablets
    strong enough smash over the head of my interlocutor.  I'm not quite
    there yet with groff strings: The Next Generation.

    While I'm pontificating I'll opine that I'm not a huge fan of C++ as
    a language, but I have found with groff that, given discipline, and
    by maintaining a clear view of its roots in C (_also_ not my
    favorite language--but one alienating, enemy-making rant at a time),
    and not picking up every f***ing new feature that gets shoved into
    the language as soon as (or before) it's standardized, it _can_ be
    managed.  But I also think that the C++ templating facility was, in
    implementation, one of the worst features ever developed for any
    programming language.

    I've decided to try to keep groff's C++ codebase ISO C++98
    compatible for the foreseeable future, even though there are _some_
    aspects of later C++ standards that I like quite a bit.  (Simple
    things, like proper damn data types and constants for null
    pointers.)  Clark wrote groff before name spaces, templates, and
    exceptions were added to the language, so you don't see them in its
    sources--it's pretty much in "Annotated Reference Manual C++", but
    if you look carefully you _will_ find some use of vec<>, added by
    later contributors.  And I have seen the pre-template,
    preprocessor-based implementation of "ITABLES" and "PTABLES", and
    no, I don't think it's prettier than templates.  The interesting
    thing is, 30+ years after adding these generic programming
    facilities, nothing in groff _ever_ specialized them beyond the the
    base types they were initially used with.  I find that suggestive.

    If you want to see generics done right, look at Ada.[5]  <mic drop>

[5] Yes, the background of C++ templates' authorship is a tragedy.

[6] 
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/env.cpp#n3790

signature.asc
Description: PGP signature

Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian

Reply via email to