Feedback wanted: substrings, invalid indices, and incompatibility mode

G. Branden Robinson Sat, 21 Mar 2026 11:31:27 -0700

Hi folks,

I wanted to solicit opinions on what to do in a weird case.


Historically, GNU troff's `substring` request, when given start and
end indices that are _both_ out of the range of the string being
sampled, the result is an empty string.

$ ~/groff-1.24.1/bin/groff -ww
.ds s abc
.pm s
{"name": "s", "file name": "<standard input>", "starting line number": 1, 
"length": 3, "contents": "abc", "node list": [ ]}
.substring s 10 20
troff:<standard input>:3: warning: start and end index of substring out of range
.pm s
{"name": "s", "file name": "<standard input>", "starting line number": 1, 
"length": 0}

This seems okay.  Throwing an error (not a warning), and not modifying
the string at all also seems like just as good an approach to me, but I
have no strong preference on this basis.

Compatibility mode makes things a little trickier.

Review groff(7):
     .as1 ident
                As “.as ident”.
     .as1 str contents
                As ”as”, with compatibility mode disabled when the
                appendment to string str is interpreted.
     .ds1 ident
     .ds1 ident contents
                As ds, with compatibility mode disabled when the string
                is interpreted.

In GNU troff, strings (and macros) can have portions that are
interpreted in AT&T compatibility mode and portions that are not.  How
does this work?  The program embeds internally generated tokens that
"push" compatibility or groff mode, and "pop" either one.

https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h?h=1.24.1#n57

$ ~/groff-1.24.1/bin/groff -ww
.ds1 s abc
.pm s
{"name": "s", "file name": "<standard input>", "starting line number": 1, 
"length": 5, "contents": "\u0089abc\u008B", "node list": [ ]}
.substring s 10 20
troff:<standard input>:3: warning: start and end index of substring out of range
.pm s
{"name": "s", "file name": "<standard input>", "starting line number": 1, 
"length": 0}

You might see the problem.  It's all well and good to say "fine, if the
slice the user wants taken of the string lies wholly outside of that
string's contents, just replace that string with an empty one."

_However_, we've _also_ now _converted_ a string that would have been
interpreted in AT&T compatibility mode into one that will not.

That's not documented.  It also doesn't seem intitutive.

It's also not clear to me what's would be intuitive in nutty situations
like the following.

$ ~/groff-1.24.1/bin/groff -ww
.ds s abc
.as1 s def
.pm s
{"name": "s", "file name": "<standard input>", "starting line number": 1, 
"length": 8, "contents": "abc\u0089def\u008B", "node list": [ ]}
.substring s 4 10
troff:<standard input>:4: warning: end index of substring out of range, set to 
string length
.pm s
{"name": "s", "file name": "<standard input>", "starting line number": 4, 
"length": 2, "contents": "ef", "node list": [ ]}

We started by defining a "groff mode" string with contents "abc".

We then _appended_ to that string the letters "def"--to be interpreted
in AT&T compatibility mode.  In the foregoing example, that makes no
difference, but it will if you embed an escaped string interpolation
inside it.  So let me illustrate that.

$ cat ATTIC/string-subsetting.groff
.ds foo FOO
.ds t \\*[foo]
.ds s abc
.as1 s \\*[t]
.cp 1
.pm s
.tm \*s
.do substring s 3 10
.pm s
.tm \*s
$ ~/groff-1.24.1/bin/groff ATTIC/string-subsetting.groff
{"name": "s", "file name": "ATTIC\/string-subsetting.groff", "starting line 
number": 3, "length": 10, "contents": "abc\u0089\\*[t]\u008B", "node list": [ ]}
abcFOO
{"name": "s", "file name": "ATTIC\/string-subsetting.groff", "starting line 
number": 8, "length": 5, "contents": "\\*[t]", "node list": [ ]}
t]

Notice how the last line says "t]" instead of "FOO".

So the string that was appended to with `as1`:

     .as1 str contents
                As ”as”, with compatibility mode disabled when the
                appendment to string str is interpreted.

...and then got subsetted, _lost_ its ability to interpolate itself with
compatibility mode disabled.

What should be done about this?

Should we reject `substring` operations that cross compatibility mode
token boundaries?

Or just an odd number of such boundaries?

Should we reject out of range indices as errors?

Should we expose the compatibility mode tokens to surgery, counting them
as part of the length and expecting GNU troff users to deal with them?

All of these would be compatibility breaks.

But the existing behavior is undocumented and seems hard to justify.

Thoughts?

Regards,
Branden

signature.asc
Description: PGP signature

Feedback wanted: substrings, invalid indices, and incompatibility mode

Reply via email to