Re: groff 1.24.0 bans C0 controls in identifiers (was: groff 1.24.0 released)

G. Branden Robinson Tue, 17 Mar 2026 05:12:00 -0700

Hi John,

At 2026-03-16T18:41:05+1100, John Gardner wrote:
> > I dispute that it wasn't harming anybody. The fact that all the same
> > characters were valid as delimiters as in identifiers, combined with
> > AT&T troff's lack of an "interpolation depth" concept, meant that it
> > was _impossible_ to robustly do things like formatted output
> > comparisons involving string or register identifiers.
> 
> I'm sorry, I still don't quite follow.


Okay.  I'll present some highly concrete demonstrations throughout this
reply.

> You can use alphanumerics as delimiters too,

Not always.  AT&T troff had multiple contexts of delimitation; I think
people tend to use string delimitation, as seen in the `tl` request, as
their only mental model of delimiter usage.

Here's just a glimpse into the exceptions from your stated rule:

* You cannot reliably use decimal digits/numerals to delimit formatted
  output comparisons.  For example...

  .if @xan@xan@ .tm formatted outputs are the same

  ...works, but...

  .if 0xan0xan0 .tm formatted outputs are the same

  ...does not.  The leading digit causes AT&T troff to decide that it's
  testing a numeric expression, and it sees enough (a zero, which is
  "falsy", followed by input that is not valid in a numeric expression)
  to decide that the condition is false.

Illustration:

$ echo '.if @xan@xan@ .tm formatted outputs are the same' \
  | dwb nroff 2>&1 | grep . || echo NO OUTPUT
formatted outputs are the same
$ echo '.if @xan@xan@ .tm formatted outputs are the same' \
  | 9 nroff 2>&1 | grep . || echo NO OUTPUT
formatted outputs are the same
$ echo '.if @xan@xan@ .tm formatted outputs are the same' \
  | heirloom nroff 2>&1 | grep . || echo NO OUTPUT
formatted outputs are the same

$ echo '.if 0xan0xan0 .tm formatted outputs are the same' \
  | dwb nroff 2>&1 | grep . || echo NO OUTPUT
NO OUTPUT
$ echo '.if 0xan0xan0 .tm formatted outputs are the same' \
  | 9 nroff 2>&1 | grep . || echo NO OUTPUT
NO OUTPUT
$ echo '.if 0xan0xan0 .tm formatted outputs are the same' \
  | heirloom nroff 2>&1 | grep . || echo NO OUTPUT
NO OUTPUT

  (Exercise: What happens if you change the "0" to "1"?  Why?)

* Similarly, you can't use the alphabetics `e`, `o`, `n`, or `t`
  to delimit formatted output comparisons.

  (Exercise: Verify this claim.  Hint: verifiability and reliability of
  outcomes is dependent on factors other than the input stream alone.
  Why?)

> and other commonly-used, non-C0 delimiter characters (like
> apostrophes) are legal in identifiers just like C0 control characters
> are:
> 
> .ds ' Apostrophe
> .if B\*'BApostropheB \{\
> . nop U+0027 is an \*' character.
> .\}

Yes, but I have neither proposed nor implemented withdrawal of support
for that idiom.

For more than you ever wanted to know about delimiter syntax and
portability, see a test script that is new to groff 1.24.0.

https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/roff/groff/tests/check-delimiter-validity.sh?h=1.24.0

> And in fact the premier use of the five control characters at issue:
> > […]
> > ...was historically seen only in delimiters.
> 
> So this is how we determine which parts of CSTR-54 are to be honoured
> and upheld, and which parts are open to interpretation?

CSTR #54 is not a formal grammar.  Every word of it is open to human
interpretation.

We don't even have a _semi_-formal BNF-style grammar for *roff.

Maybe some day I will know enough to write one down.  I worry that it's
strictly impossible, because the ability to change the control, no-break
control, and escape characters might make the language formally
undecidable (as is the case with the POSIX shell thanks to the "alias"
feature[1]).  But I don't know if a language's undecidability is a
barrier to writing a BNF-style grammar for it.  (It might not be
_strictly_ irreconcilable, but I can imagine a hazard remaining such
that the resulting grammar would be useless in practice.)

If I were more of a computer scientist, maybe I'd have answers.

> By gauging historical usage? Do independent macro authors and general
> Roff hacking enthusiasts not count?

All three are important.  But more importantly, CSTR #54 is not inerrant
scripture.  It has errors and ambiguities.

See "CSTR #54 errata" in the following.

https://www.gnu.org/software/groff/manual/groff.html.node/Concept-Index.html

> > I'm sure--but who has ever actually done this? Can you point me to
> > examples?
> 
> I have. I'm putting my cards on the table here, but I have a macro
> package named Mono that's designed to work with as many Troff
> implementations (both current and historic), gracefully degrading in
> the absence of more modern approaches.[1]

I've seen its GitHub page before, but didn't look at it closely enough
to perceive this element of its design.

> To this end, it sticks to a 2-character namespace, and to limit the
> risk of clobbering an existing 2-character string or register
> definition, throwaway registers/strings defined within the body of a
> macro to hold temporary values use the prickliest,
> least-likely-to-be-used names possible (which are removed before the
> macro's end).

But you're still playing probabilities with that approach.  You're still
at the mercy of an input document author making the same decision to try
and stay out of _your package's_ way, not knowing of your strategy.

It's like the old problem of two cars arriving simultaneously at a
four-way stop.  If there's no protocol for tie-breaking, the risk of
a collision is non-negligible.

A two character namespace using only Basic Latin code points limits
*roff to 9,025 identifiers.[2]  Grabbing hold of the only 5 additional
control characters ^B, ^C, ^E, ^F, ^G that you can rely on to survive
input processing nets you only 10,000.  I'll concede that that's a nice
round number, but that is all.  It's about a 10% expansion of the name
space.  Outside of performance measurement, in computing we tend to
hand-wave away improvements (or even sometimes detriments) of less than
_an order of magnitude_.

> Here's an example of where this is done: Mono's .``
> <https://github.com/Alhadis/Mono/blob/25765171fbf676b623a4bcbf3d9f93384ef83040/ono.tmac#L250-L302>
> macro, used to format a code-block (specifically, my current terminal
> session at the moment…):
> 
> .`` shell-session 4m
> $ groff -Tutf8 -M ~/Labs/Mono -mono test.roff 2>&1 | trimend | uniq
> troff:/Users/Alhadis/Labs/Mono/ono.tmac:863: error: character code 5 is not
> allowed in an identifier
> troff:test.roff:3: error: character code 2 is not allowed in an identifier
>  (printed ×3)
> troff:test.roff:12: error: character code 2 is not allowed in an identifier
> (printed ×3)
> .``

I see.

> When the opening .`` is called, a temporary macro named .`^B is
> defined that restores the settings (indent, font, hyphenation,
> fill-mode, etc) that were changed by the macro, whose original values
> were "baked" into .`^B by interpolating their values with one less “\”
> than would normally be used.
> 
> But enough of the rambling advertisement. Do I expect you (or anybody
> else, really) to give a two shits about my half-finished
> macro-package? No. Do I expect you, as maintainer, to sanction
> breaking changes to Groff's syntax *if and only if* there's a
> *convincing, compelling, and practical reason for doing so?* Yes.

I think I've made it.

> > I would guess, though I don't know, that's exactly why DWB 3.3 took
> > the decision to ban C0 controls from use in identifiers.
> 
> If *I* had to guess, it had more to do with DWB's implementation of
> terminal driving tables
> <https://github.com/n-t-roff/DWB3.3/blob/2dcc55026310cccef38dcd549b140bf15dfd3f0f/text/nterm/README>
> which need to reserve certain control characters that map to functions
> and physical motions in hardware teletypes. Here's the driving table
> for a Model 37
> <https://github.com/n-t-roff/DWB3.3/blob/094c0be1f5ec0b63cd3383a0c670bc53d4af40e3/text/nterm/tab.37>
> .

I don't think that can be the reason, because DWB 3.3 is not unique in
its use of "driving tables".  All Kernighan-descended troffs use them,
so you find them in System V/Solaris troff, Plan 9 troff, and Heirloom
Doctools troff.

> But please, enough with the DWB whataboutisms already. This isn't
> about portability to historical Troff versions, *it's about
> portability between Groff versions.* And you know that as well as I
> do.

I can't reconcile this reduction in scope with your self-imposed
restriction to two-character identifiers when implementing your Mono
package.  groff has allowed identifiers of arbitrary length from day
one, or every day of the past 35 years at any rate.

> > It also happens to be the case that when you view [documents] in,
> > say, the Firefox web browser when visiting the TUHS "Unix Tree",
> > these control characters simply aren't visible.
> 
>    1. *TUHS can always use an embedded webfont to display “invisible”
>    control characters.* Find an example CSS snippet with a data
>    URI-encoded WOFF2 font I hacked together just now as a POC.

Okay.  I offered TUHS as an exemplar of a common problem.  Why don't we
permit control characters in the names of shell variables or C
identifiers?

>    2. *The issue of visibility affects delimiters as well.* So it's
>    not as though banning C0 characters in identifiers is adding any
>    rendering issues that TUHS doesn't already suffer from.

That's true, and I think it's a good idea to avoid them in those
contexts as well.  In groff, you never need them.  AT&T troff presents a
thornier problem.  To avoid the issue, one needs to implement GNU
troff's concept of "input level" or "interpolation depth", or something
like it.

Further reading:

https://www.gnu.org/software/groff/manual/groff.html.node/Delimiters.html
https://www.gnu.org/software/groff/manual/groff.html.node/Compatibility-Mode.html

>    3. *You can configure Firefox to show such control characters in
>       text.* Open about:config and enable the following flags:
>       - layout.css.control-characters.visible
>       
> <https://searchfox.org/firefox-main/rev/7b08fa00f500ed877b16983a6a77d2c852aad1d0/modules/libpref/init/StaticPrefList.yaml#10350>
>       - layout.css.moz-control-character-visibility.enabled
>       
> <https://searchfox.org/firefox-main/rev/7b08fa00f500ed877b16983a6a77d2c852aad1d0/modules/libpref/init/StaticPrefList.yaml#10210>
>    Note that Firefox already displays control characters in pages
>    displayed as plain-text (e.g., *file://localhost/tmp/foo.txt*, or
>    ono.tmac <https://raw.github.com/Alhadis/Mono/HEAD/ono.tmac>).[2]

Interesting!  I'll check that out.  I use Firefox and I wonder why it
doesn't surmount TUHS's problem for me by default.  Oh, wait, it's due
to the final factor you mentioned.  Warren has the site set up to serve
historical Unix files as HTML, not plain text.

[rearranged]

> *Manually inserted footnotes, BiblioBrandenᵀᴹ style:*
> 
> [1]: E.g., the C/A/T phototypesetter didn't support true line-drawing,

Huh!  I didn't know that.  I thought that was just a hangover over the
Teletype Model 37 and similar.  But it's making more sense to me now;
just this week Clem Cole sent me in private mail a bit of a brain-dump
about working with the C/A/T.  If I understood him correctly, then yeah,
I don't see how the thing could draw arbitrary lines.

> so Mono's .UL macro employs \l'…' to print repeated underscores, and a
> proper \D'l …' for more modern Troffs. For text-based output (.if n
> …), .cu is used to underline spaces (though I still need to fix the
> issue of line-wrapping…)

Coincidentally, I've turned my attention to `ul` and `cu` this past week
as well.

https://cgit.git.savannah.gnu.org/cgit/groff.git/commit/?id=7214ba10ad2c92d352a693739295e785edda8afb
https://cgit.git.savannah.gnu.org/cgit/groff.git/commit/?id=ba78e9f7422ca4f3b725e3ad8a05fee9c44c0026

> I don't know how else to conclude this e-mail, so I'll end with the
> only topically-appropriate way I know how.

Okay, here's the promised lengthy (but concrete) exploration of the
practical and portability problems motivating the change of which you
complain.

$ cat simp.tmac
.\" macro package to demonstrate hazards of control chars in identifiers
.de HE
.  tl '\\*(LH'\\*(CH'\\*(RH'
.  sp 0.5i
..
.de FT
.  tl \\*(LF\\*(CF\\*(RF
'  bp
..
.wh 0 HE
.wh -0.5i-1u FT
$ cat -v demo1.simp
.so simp.tmac
.ds LH Bunyan
.ds CH The Pilgrim's Progress
.ds RH \\n%
.ds LF DRAFT
.ds CF evil^Binput\" string contents may be weirder than they appear
.ds RF 2026-03-15
.sp
Hello, CoE dissidents!
.sp
Thank you for reading my novel,
banned by over 50 vicars in Hertfordshire alone.
$ dwb nroff demo1.simp |cat -s
Bunyan                     The Pilgrim                 s Progress

Hello, CoE dissidents!

Thank you for reading my novel,  banned  by  over  50  vicars  in
Hertfordshire alone.

DRAFT                         evil                          input

$ 9 nroff demo1.simp |cat -s
Bunyan                     The Pilgrim                 s Progress

Hello, CoE dissidents!

Thank you for reading my novel,  banned  by  over  50  vicars  in
Hertfordshire alone.

DRAFT                         evil                          input

$ heirloom nroff demo1.simp |cat -s
Bunyan                     The Pilgrim                 s Progress

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert-
fordshire alone.

DRAFT                         evil                          input

$ ~/groff-1.23.0/bin/nroff demo1.simp |cat -s
Bunyan                The Pilgrim’s Progress                    1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert‐
fordshire alone.

DRAFT                       evilinput                 2026‐03‐15

$ ~/groff-1.24.0/bin/nroff demo1.simp |cat -s
Bunyan                The Pilgrim’s Progress                    1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert‐
fordshire alone.

DRAFT                       evilinput                 2026‐03‐15


We can get consistent output among troffs by just not using an
apostrophe in our center header.  A better solution would be for the
macro package author to use a control character as a delimiter.
However, we _still_ have a problem if the document author elects to use
the same delimiter in their string contents.

$ cat -v demo2.simp
.so simp.tmac
.ds LH Bunyan
.ds CH The Pilgrims Progress
.ds RH \\n%
.ds LF DRAFT
.ds CF evil^Binput\" string contents may be weirder than they appear
.ds RF 2026-03-15
.sp
Hello, CoE dissidents!
.sp
Thank you for reading my novel,
banned by over 50 vicars in Hertfordshire alone.
$ dwb nroff demo2.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel,  banned  by  over  50  vicars  in
Hertfordshire alone.

DRAFT                         evil                          input

$ 9 nroff demo2.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel,  banned  by  over  50  vicars  in
Hertfordshire alone.

DRAFT                         evil                          input

$ heirloom nroff demo2.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert-
fordshire alone.

DRAFT                         evil                          input

$ ~/groff-1.23.0/bin/nroff demo2.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert‐
fordshire alone.

DRAFT                       evilinput                 2026‐03‐15

$ ~/groff-1.24.0/bin/nroff demo2.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert‐
fordshire alone.

DRAFT                       evilinput                 2026‐03‐15


We can see that hazards remain, and only GNU troff overcomes them, but
nevertheless that's not squarely the issue to which you object.  So
here's an alternative version of this macro package, one which
deliberately uses control characters in identifier names.  Unlike your
Mono project, these names are "exposed" via the notional "API", though
that doesn't make any real difference as far as I can see.

$ cat -v altsimp.tmac
.\" macro package to demonstrate hazards of control chars in identifiers
.\"
.\" let's use clever names for our macro and string names so as to
.\" not step on likely user selections.
.de H^C
.  tl '\\*(L^E'\\*(C^E'\\*(R^E'
.  sp 0.5i
..
.de F^C
.  tl ^B\\*(L^F^B\\*(C^F^B\\*(R^F^B
'  bp
..
.wh 0 H^C
.wh -0.5i-1u F^C
$ cat -v demo3.simp
.so altsimp.tmac
.ds L^E Bunyan
.ds C^E The Pilgrims Progress
.ds R^E \\n%
.ds L^F DRA^FFT
.ds C^F evil^Binput\" string contents may be weirder than they appear
.ds R^F 2026-03-15
.sp
Hello, CoE dissidents!
.sp
Thank you for reading my novel,
banned by over 50 vicars in Hertfordshire alone.
$ dwb nroff demo3.simp |cat -s
 DRAFT                      evilinput                  2026-03-15

Hello, CoE dissidents!

Thank you for reading my novel,  banned  by  over  50  vicars  in
Hertfordshire alone.

 DRAFT                         evil                         input

$ 9 nroff demo3.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel,  banned  by  over  50  vicars  in
Hertfordshire alone.

DRAFT                         evil                          input

$ heirloom nroff demo3.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert-
fordshire alone.

DRAFT                         evil                          input

$ ~/groff-1.23.0/bin/nroff demo3.simp |cat -s
Bunyan                The Pilgrims Progress                     1

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert‐
fordshire alone.

DRAFT                      evilinput                 2026‐03‐15

$ ~/groff-1.24.0/bin/nroff demo3.simp |cat -s
troff:altsimp.tmac:5: error: character code 3 is not allowed in an identifier
troff:altsimp.tmac:9: error: character code 3 is not allowed in an identifier
troff:altsimp.tmac:13: error: character code 3 is not allowed in an identifier
troff:altsimp.tmac:14: error: character code 3 is not allowed in an identifier
troff:demo3.simp:2: error: character code 5 is not allowed in an identifier
troff:demo3.simp:3: error: character code 5 is not allowed in an identifier
troff:demo3.simp:4: error: character code 5 is not allowed in an identifier
troff:demo3.simp:5: error: character code 6 is not allowed in an identifier
troff:demo3.simp:6: error: character code 6 is not allowed in an identifier
troff:demo3.simp:7: error: character code 6 is not allowed in an identifier
\*(L                         \*(C                         \*(R

\*(L                         \*(C                         \*(R

Hello, CoE dissidents!

Thank you for reading my novel, banned by over 50 vicars in Hert‐
fordshire alone.


As expected, we observe inconsistent behavior between DWB 3.3 nroff on
the one hand and Plan 9 and Heirloom Doctools nroffs on the other.  (Yet
all use "driving tables".)

I submit that these matters are subtle land mines that are sure to
explode beneath the unwary--where "the unwary" comprises every person
who is a not a world-class expert in *roff implementations.

My argument to you is that the confusion created by availability of
those control characters in identifiers is not worth the ~10% increase
in name space size you obtain by using them.

If you need more than 9,025 distinct identifiers, use groff's extended
syntax of "long name" support, which gives you multiple orders of
magnitude more name space capacity--and great flexibility in selecting
your identifier names.  And thanks to the "input level"/"interpolation
depth" concept, you need not worry about collisions with delimiters
anyway.

I'm not happy to have broken your application of groff.  How do you
measure the value of the advantage you claim to have achieved by
employing control characters in identifiers when voluntarily contorting
yourself into a straitjacket by using only two-character identifiers?

You suggest that consideration of DWB/Plan 9/Heirloom Doctools *roffs is
a distraction because your objection is founded on "*[] portability
between Groff versions[]*".  If that's true, why have you limited your
macro package's use of name space for compatibility with AT&T troff?

Regards,
Branden

[1] https://archive.fosdem.org/2018/schedule/event/code_parsing_posix_s_hell/

[2] A space U+0020 is unusable in identifiers in all troffs.  We count
    95 thanks to the 94 graphical code points of Basic Latin plus the
    DEL (delete) character, U+007F.

signature.asc
Description: PGP signature

Re: groff 1.24.0 bans C0 controls in identifiers (was: groff 1.24.0 released)

Reply via email to