Hi Oliver, At 2023-04-25T16:25:49+0200, Oliver Corff wrote: > In the meantime, I had a look at that Russian hyphenation file, and to > my relief, the structure of the groff hyphenation pattern files is > that of TeX hyphenation pattern files, which I have worked on before.
Yup. They were born that way. > But... the hyphenation file hyphen.ru in the aforementioned source is > not usable in the current set-up because the Russian syllable > fragments are encoded in KOI-8, an 8 bit encoding based on a GOST > Standard of the USSR. > > So, it does not match the internal code representation of Unicode code > points. No, it doesn't. But some of the other hyphenation pattern files don't, either; if you look you will see that they're encoded variously in ISO 646, ISO 8859-1, ISO 8859-2, and ISO 8859-15. This is because groff's hyphenation pattern file parser doesn't understand UTF-8. That would be a nice thing to have. hyphen.ru does a very sneaky thing that I did not think was possible before Nikita Ivanov dropped it on our doorstep and I took a closer look at the KOI8-R encoding. You might know that code points in the "C1 Controls" block of Unicode (U+0080..U+009F) are invalid input characters to groff. groff uses them for internal, bespoke purposes.[1] This is a barrier to making groff support UTF-8 input directly, as noted in our documentation.[2][3] But an interesting property of KOI8-R is that none of the glyphs it heaps up in the C1 region are alphabetic. Therefore they don't require hyphenation. Therefore the Russian hyphenation patterns, using KOI8-R, can masquerade effectively as an ISO 8859 encoding. This is the same deal that lets us support ISO 8859-{2,15} in our hyphenation patterns. GNU troff doesn't actually care what these code points "are", it only needs to know their values to make hyphenation decisions. The intelligibility of the hyphenation patterns to a human reader is determined by the character encoding, but within the range U+00A0..U+00FF (actually more than that: U+0021..U+007F as well), groff has no dog in the semantic interpretation fight. > Since groff internally seems to work with Unicode code positions, the > question is: in which format should the hyphenation patterns be > presented to groff? As-is, that is as utf8 text, or in \[u04xx] form? > That does not seem to work either, according to my last experiment. For now, neither; the KOI8-R cheat seems to work fine, as far as I can tell or understand. Admittedly, I'm not a Russian speaker. But I believe the contributor is. Eventually, we will need a way for our hyphenation pattern file reader function[6] to interpret UTF-8 input. The cleanest thing to do would be to have it use the same facility as regular GNU troff input stream reading support for UTF-8. But that has to be written first. Regards, Branden [1] https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h [2] https://www.dropbox.com/sh/17ftu3z31couf07/AAC_9kq0ZA-Ra2ZhmZFWlLuva?dl=0 Open groff.YYYY-MM-DD.pdf, where the date changes from time to time; see pages 73 and 84 (as of this writing). [3] It can be done; it's just harder than migrating from ASCII to UTF-8. My idea is to relocate all these bespoke groff symbols to the Unicode Private Use Area. But for that we need to change groff's string class[4] to build upon either (1) wide characters or (2) multibyte characters. My preference is to go straight to char32_t. [4] groff, having been first written in about 1989, does not use the Standard C++ library string class. This has proven unproblematic; it's implemented well and I'm not aware of any defect _ever_ being exposed in it. (This illustrates that James Clark was a better C++ programmer than most.) But if I change it, someone's going to ask me why I don't just migrate to Standard C++ library facilities for it and I need a good answer. I'm working on that. When defending my engineering decisions, I prefer to be equipped with stone tablets strong enough smash over the head of my interlocutor. I'm not quite there yet with groff strings: The Next Generation. While I'm pontificating I'll opine that I'm not a huge fan of C++ as a language, but I have found with groff that, given discipline, and by maintaining a clear view of its roots in C (_also_ not my favorite language--but one alienating, enemy-making rant at a time), and not picking up every f***ing new feature that gets shoved into the language as soon as (or before) it's standardized, it _can_ be managed. But I also think that the C++ templating facility was, in implementation, one of the worst features ever developed for any programming language. I've decided to try to keep groff's C++ codebase ISO C++98 compatible for the foreseeable future, even though there are _some_ aspects of later C++ standards that I like quite a bit. (Simple things, like proper damn data types and constants for null pointers.) Clark wrote groff before name spaces, templates, and exceptions were added to the language, so you don't see them in its sources--it's pretty much in "Annotated Reference Manual C++", but if you look carefully you _will_ find some use of vec<>, added by later contributors. And I have seen the pre-template, preprocessor-based implementation of "ITABLES" and "PTABLES", and no, I don't think it's prettier than templates. The interesting thing is, 30+ years after adding these generic programming facilities, nothing in groff _ever_ specialized them beyond the the base types they were initially used with. I find that suggestive. If you want to see generics done right, look at Ada.[5] <mic drop> [5] Yes, the background of C++ templates' authorship is a tragedy. [6] https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/env.cpp#n3790
signature.asc
Description: PGP signature