URL: <https://savannah.gnu.org/bugs/?68230>
Summary: [libgroff] our `symbol` type is really starting to
get in the way
Group: GNU roff
Submitter: gbranden
Submitted: Sat 11 Apr 2026 09:29:54 AM UTC
Category: Core
Severity: 3 - Normal
Item Group: Refactoring
Status: Confirmed
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Unlocked
Planned Release: None
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: Sat 11 Apr 2026 09:29:54 AM UTC By: G. Branden Robinson <gbranden>
From my working copy:
diff --git a/src/roff/troff/env.cpp b/src/roff/troff/env.cpp
index a019f0472..46b5addde 100644
--- a/src/roff/troff/env.cpp
+++ b/src/roff/troff/env.cpp
@@ -3931,7 +3931,7 @@ static const size_t bpbuflen = WORD_MAX + 2 /* leading
'-' + '\0' */;
// excepted). Otherwise, return `false`; the contents of `word`,
// `breakpoint_count`, and `breakpoints` are then not meaningful and
// should not be used.
-static bool read_hyphenation_exception_word(char *word,
+static bool read_hyphenation_exception_word(unsigned char *word,
int *breakpoint_count,
unsigned char *breakpoints)
{
@@ -4007,7 +4007,7 @@ static void add_hyphenation_exception_words_request() //
.hw
return;
}
// C++11: char wordbuf[wordbuflen]{};
- char wordbuf[wordbuflen];
+ unsigned char wordbuf[wordbuflen];
(void) memset(wordbuf, 0, wordbuflen);
// C++11: unsigned char bpbuf[bpbuflen]{};
unsigned char bpbuf[bpbuflen];
@@ -4027,8 +4027,17 @@ static void add_hyphenation_exception_words_request()
// .hw
}
(void) memset(tem, 0, ((newbpbuflen) * sizeof(unsigned char)));
memcpy(tem, bpbuf, newbpbuflen);
+ // XXX: GNU troff's slovenly historical practice of punning `char`
+ // with `unsigned char` (because the deep wisdom of 1990 says that
+ // ISO 8859 ought to be enough code points for anybody) bites us
+ // here. We read the input stream as a sequence of unsigned chars
+ // but the keys of a `symbol` dictionary are sequences of _plain_
+ // `char`s, likely so that they can be dealt with using C standard
+ // library <string.h> functions. But these types are not the
+ // same. --GBR, 2026
+ char *wbuf = reinterpret_cast<char *>(wordbuf);
tem = static_cast<unsigned char *>
- (current_language->exceptions.lookup(symbol(wordbuf), tem));
+ (current_language->exceptions.lookup(symbol(wbuf), tem));
if (tem != 0 /* nullptr */)
delete[] tem;
}
This is not an easy hill to climb.
* We could add a constructor for `symbol` that takes a pointer to `unsigned
char` as its first argument, and do the nuclear `reinterpret_cast<>`-ing in a
more hidden way inside _libgroff_. That doesn't fix ugliness, but merely
conceals it.
* The `symbol` class's only other constructor does a **ton** of work.
https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/symbol.cpp?h=1.24.1#n86
* [https://isocpp.org/wiki/faq/ctors#init-methods You can't delegate
constuctors in C++98], so even if we did that, we'd have to either duplicate
that buttload of code--inviting a hazard of desynchronization, and bloating
the "libgroff.a" object that gets compiled into most of _groff_'s executables,
or refactor out much of the logic into an "init method" as that link
discusses. Any such refactoring has hazards. (That said,
[https://paste.c-net.org/TummyFloozy I think we can be confident that breakage
would be swiftly caught.])
* Worse, we'd have to worry about the effect of the changed data type on the
hash algorithm. Maybe it would be fine. Do we **know**?
* Migrating `symbol` to be `unsigned char`-based takes away our ability to use
traditional standard C string library functions, which the _groff_ codebase
does copiously since there was no C++ standard library at the time Clark wrote
_groff_. C++ wasn't even standardized until a few years after he retired as
maintainer.
* The return on the any time investment in _groff_'s `symbol` class seems low
to me until and unless we simply port the thing to STL containers, enabling us
to discard a lot of sophomore-year CS student data structure management
machinery and manual memory management. See bug #66672.
* On the bright side, porting `symbol` to use an STL `unordered_map` or
similar, keyed on a `vector` of `char32_t` elements, seems like it should
carry us most of the way to fulfilling a dream of Deri's, which is Unicode
names for registers, strings and macros. Users in general will notice, too,
and jacked identifiers like this are a common selling point for modern
programming languages. You can name a variable 🎈 or ☠in many of them.
It'd be neat to do that in GNU _troff_ as well.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?68230>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
