At 2023-04-29T21:38:53-0500, Dave Kemper wrote: > On 4/29/23, Oliver Corff <oliver.co...@email.de> wrote: > > Would it be a feasible option to use UTF-8 throughout the inner > > workings of a future groff,
I'm going to phrase this more confrontationally than it needs to be just to make a point about software design: It's none of your business what data type groff uses for characters in its _inner workings_. Of course I mean that purely from the software-architectural perspective. There is no reason for anyone except groff's developers to care what primitive data type groff uses for this purpose as long as it behaves correctly and is performant. The whole point of encapsulation is to keep other software modules from having to worry about this sort of thing. In another sense, it's totally your business and you can look at the implementation at any time--it's Free Software. But other software, including parts of groff that are not GNU troff, the formatter, should keep its dirty nose out, and expect to be excluded through language-imposed visibility restrictions (or the impermeable wall of the Unix process structure). We absolutely want good UTF-8 support at the _edges_ of the system. We want to change GNU troff to cheerfully and correctly interpret UTF-8 input. And we want output drivers that target devices using UTF-8 as a character encoding to reliably produce it. But that's all. > This is the topic of http://savannah.gnu.org/bugs/?40720 [...] > But in my opinion, the discussion is somewhat academic given the scope > of the task and the number of current groff developers familiar with > core parts of the code. My idea for the initial scope is <cough> small. I'm not convinced that the groff string class is sealed as tightly as it should be. So when I take a second crack at changing its internal data type (my first was 2 years or so ago), I need to review it carefully. From what I've seen the main point of interface we're concerned with is its `contents` member function, which does in fact return a pointer to a narrow character. Possibly that needs to be renamed `as_c_string`, and existing uses of `contents` audited to verify that they really do need a C string there, or if they wouldn't work just as well dealing with something else. Our diagnostic message functions (`fatal`, `error`, `warning`, `debug` and friends) _do_ expect C strings. I don't see that changing, since their next stop is the standard error stream. As part of this I also need to look over the ISO C++98 string class and see how much sense just to make groff's string class a basic_string<char32_t>.[1] A rough sketch of my plan is this: 1. Ensure that the groff string class is well-encapsulated. 2. Change the internal type, and constructors and output functions only, to perform is transformation on this new type. 3. Verify that nothing broke. (If I did 1 and 2 correctly, nothing will.) 4. Remap the code points we're squatting on. Haven't decided yet whether to map them to illegal Unicode code points or to the Unicode Private Use Area. With a char32_t we have all the room in the world. 5. Drop code page 1047 support, per recent discussions with Mike Fulton of IBM on this list. 6. Start not merely accepting, but _assuming_ UTF-8 input, because we won't misinterpret C1 controls anymore. If that doesn't sound like enough work--at some point in the above, each and every preprocessor has to be checked to ensure it isn't screwing up the input before it gets to the formatter. I don't see getting rid of preconv(1) in the near term. It will remain useful, particularly if I add the couple of small features I had in mind for it. It may continue to play a role in getting input into the correct Unicode Normalization Form (D). It might make sense to leave that business out of the formatter proper. Regards, Branden [1] std::u32string is C++11, and thus not available according to the portability horizon we have. But we can make our own basic_string<char32_t> with C++98 facilities and gnulib's 'inttypes' module. Hooray, templates! ;-)
signature.asc
Description: PGP signature