URL:
  <https://savannah.gnu.org/bugs/?68230>

                 Summary: [libgroff] our `symbol` type is really starting to
get in the way
                   Group: GNU roff
               Submitter: gbranden
               Submitted: Sat 11 Apr 2026 09:29:54 AM UTC
                Category: Core
                Severity: 3 - Normal
              Item Group: Refactoring
                  Status: Confirmed
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Unlocked
         Planned Release: None


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Sat 11 Apr 2026 09:29:54 AM UTC By: G. Branden Robinson <gbranden>
From my working copy:


diff --git a/src/roff/troff/env.cpp b/src/roff/troff/env.cpp
index a019f0472..46b5addde 100644
--- a/src/roff/troff/env.cpp
+++ b/src/roff/troff/env.cpp
@@ -3931,7 +3931,7 @@ static const size_t bpbuflen = WORD_MAX + 2 /* leading
'-' + '\0' */;
 // excepted).  Otherwise, return `false`; the contents of `word`,
 // `breakpoint_count`, and `breakpoints` are then not meaningful and
 // should not be used.
-static bool read_hyphenation_exception_word(char *word,
+static bool read_hyphenation_exception_word(unsigned char *word,
                                            int *breakpoint_count,
                                            unsigned char *breakpoints)
 {
@@ -4007,7 +4007,7 @@ static void add_hyphenation_exception_words_request() //
.hw
     return;
   }
   // C++11: char wordbuf[wordbuflen]{};
-  char wordbuf[wordbuflen];
+  unsigned char wordbuf[wordbuflen];
   (void) memset(wordbuf, 0, wordbuflen);
   // C++11: unsigned char bpbuf[bpbuflen]{};
   unsigned char bpbuf[bpbuflen];
@@ -4027,8 +4027,17 @@ static void add_hyphenation_exception_words_request()
// .hw
       }
       (void) memset(tem, 0, ((newbpbuflen) * sizeof(unsigned char)));
       memcpy(tem, bpbuf, newbpbuflen);
+      // XXX: GNU troff's slovenly historical practice of punning `char`
+      // with `unsigned char` (because the deep wisdom of 1990 says that
+      // ISO 8859 ought to be enough code points for anybody) bites us
+      // here.  We read the input stream as a sequence of unsigned chars
+      // but the keys of a `symbol` dictionary are sequences of _plain_
+      // `char`s, likely so that they can be dealt with using C standard
+      // library <string.h> functions.  But these types are not the
+      // same.  --GBR, 2026
+      char *wbuf = reinterpret_cast<char *>(wordbuf);
       tem = static_cast<unsigned char *>
-           (current_language->exceptions.lookup(symbol(wordbuf), tem));
+           (current_language->exceptions.lookup(symbol(wbuf), tem));
       if (tem != 0 /* nullptr */)
        delete[] tem;
     }


This is not an easy hill to climb.

* We could add a constructor for `symbol` that takes a pointer to `unsigned
char` as its first argument, and do the nuclear `reinterpret_cast<>`-ing in a
more hidden way inside _libgroff_.  That doesn't fix ugliness, but merely
conceals it.

* The `symbol` class's only other constructor does a **ton** of work.
https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/symbol.cpp?h=1.24.1#n86

* [https://isocpp.org/wiki/faq/ctors#init-methods You can't delegate
constuctors in C++98], so even if we did that, we'd have to either duplicate
that buttload of code--inviting a hazard of desynchronization, and bloating
the "libgroff.a" object that gets compiled into most of _groff_'s executables,
or refactor out much of the logic into an "init method" as that link
discusses.  Any such refactoring has hazards.  (That said,
[https://paste.c-net.org/TummyFloozy I think we can be confident that breakage
would be swiftly caught.])

* Worse, we'd have to worry about the effect of the changed data type on the
hash algorithm.  Maybe it would be fine.  Do we **know**?

* Migrating `symbol` to be `unsigned char`-based takes away our ability to use
traditional standard C string library functions, which the _groff_ codebase
does copiously since there was no C++ standard library at the time Clark wrote
_groff_.  C++ wasn't even standardized until a few years after he retired as
maintainer.

* The return on the any time investment in _groff_'s `symbol` class seems low
to me until and unless we simply port the thing to STL containers, enabling us
to discard a lot of sophomore-year CS student data structure management
machinery and manual memory management.  See bug #66672.

* On the bright side, porting `symbol` to use an STL `unordered_map` or
similar, keyed on a `vector` of `char32_t` elements, seems like it should
carry us most of the way to fulfilling a dream of Deri's, which is Unicode
names for registers, strings and macros.  Users in general will notice, too,
and jacked identifiers like this are a common selling point for modern
programming languages.  You can name a variable  🎈 or  ☭ in many of them.
 It'd be neat to do that in GNU _troff_ as well.







    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?68230>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Attachment: signature.asc
Description: PGP signature

Reply via email to