Update of bug #66919 (group groff):
Status: None => Need Info
Assigned to: None => barx
_______________________________________________________
Follow-up Comment #24:
[comment #23 comment #23:]
> Now we get to where my conceptual groundwork of comment #20 starts
> interacting with concrete examples.
Good, 'cause I have some too! :D
First let me start with this illustrator.
$ printf '.ll 3n\ndomain\n' | groff -a -Wbreak
<beginning of page>
do<hy>
main
GNU _troff_ will break in the same place any word with a letter equivalent to
"o" in the same place.
$ printf '.ll 3n\nd\[`o]main\n' | groff -a -Wbreak
<beginning of page>
d<`o><hy>
main
Recalling from our discussion in bug #66112, and my selection of your first
suggestion over your second, o-with-tilde-accent is _not_ equivalent to "o" in
English, so it shouldn't break...
$ printf '.ll 3n\nd\[~o]main\n' | groff -a -Wbreak
<beginning of page>
d<~o>main
...and indeed it doesn't.
That established...
> That's a fair statement. But even though I'm running groff with its default
> startup (English) files, the behavior I'm talking about in this ticket is in
> the formatter, not in any startup files. What I'm talking about has nothing
> to do with the input _language_ and everything to do with input _encoding_.
I agree!
> (You'll notice that I'm not providing any sample input with any English
> words. The two words I've used, lanteronial, and lanterõnial--and then only
> to work around the lack of .pchar in older groffs--aren't part of any
> language that I'm aware of. So I'm talking about general formatter behavior,
> independent of any language setting.)
But you're not talking about _general_ formatter behavior, you're talking
about formatter behavior **after the "latin1.tmac" file is loaded**.
Observe.
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
Let's try it with the "raw" o with tilde accent character, Latin-1 245 decimal
(365 octal).
$ printf '.ll 3n\nd\365main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak
<beginning of page>
/home/branden/groff-HEAD/bin/troff:<standard input>:2: warning: character with
input code 245 not defined
dmain
$ printf '.ll 3n\nd\365main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak
<beginning of page>
/home/branden/groff-1.23.0/bin/troff:<standard input>:2: warning: character
with input code 245 not defined
dmain
$ printf '.ll 3n\nd\365main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak
<beginning of page>
/home/branden/groff-1.22.4/bin/troff: <standard input>:2: warning: can't find
character with input code 245
dmain
$ printf '.ll 3n\nd\365main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak
<beginning of page>
<standard input>:2: warning: can't find character with input code 245
dmain
This makes sense, because in all released versions of _groff_, the formatter
doesn't yet know, before loading startup files, whether it's going to be
operating in a Latin-1 or EBCDIC (code page 1047) environment. (Well,
technically it *can* know just by checking the character code of, say, "a",
but it stays as agnostic as it can and lets macro files do most of the
lifting.)
Let's macro-load "latin1.tmac" in our examples and see if that changes
anything.
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-HEAD/bin/troff -Ra
-Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.23.0/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.22.4/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.22.3/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
The character code is now recognized, and translated on input (`trin`) to the
special character `~o`. But it still doesn't hyphenate.
For completeness, let's see if explicitly specifying the special character
changes behavior.
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
Still no. Finally let's load "en.tmac", which didn't exist prior to 1.23.0.
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra
-Wbreak
<beginning of page>
d<~o>main
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra
-Wbreak
<beginning of page>
d<~o>main
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra
-Wbreak
/home/branden/groff-1.22.4/bin/troff: <standard input>:1: warning: can't find
macro file 'en.tmac'
<beginning of page>
d<~o>main
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra
-Wbreak
<standard input>:1: warning: can't find macro file `en.tmac'
<beginning of page>
d<~o>main
So here are a bunch more cases where formatter behavior doesn't change, all
using the same special character you've chosen.
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
Why does "lanteronial" (not an English word) hyphenate differently from
"domain" (definitely an English word)? To answer that requires a source dive,
which is coming shortly. But first, I must ask:
So the hyphenation of a non-English word using a letter that doesn't exist in
the English alphabet has changed from _groff_ 1.23.0 to (what will become)
1.24.0. Is it fair to call that a regression?
I think you've identified a relatively dusty crevice in a corner case, and
that it arises solely due a presumption that was being made in
`set_hyphenation_code()` for many years.
So why did commit a52141ac46eef95dd1f85e4c2e0a336affa9bcc9 change things?
Let's look at the diff again.
diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp
index cc7d9dd71..946b93570 100644
--- a/src/roff/troff/input.cpp
+++ b/src/roff/troff/input.cpp
@@ -7309,25 +7309,26 @@ static void set_hyphenation_codes()
error("cannot use the hyphenation code of a numeral");
break;
}
- unsigned char new_code = 0; // TODO: int
+ unsigned char new_code = 0;
charinfo *cisrc = tok.get_char();
- if (csrc != 0)
- new_code = csrc;
- else {
+ if (cisrc != 0 /* nullptr */)
+ // Common case: assign destination character the hyphenation code
+ // of the source character.
+ new_code = cisrc->get_hyphenation_code();
+ if (0 == csrc) {
if (0 /* nullptr */ == cisrc) {
error("expected ordinary or special character, got %1",
tok.description());
break;
}
- // source character is special
- if (0 == cisrc->get_hyphenation_code()) {
- error("second member of hyphenation code pair must be an"
- " ordinary character, or a special character already"
- " assigned a hyphenation code");
- break;
- }
new_code = cisrc->get_hyphenation_code();
}
+ else {
+ // If assigning a ordinary character's hyphenation code to itself,
+ // use its character code point as the value.
+ if (csrc == cdst)
+ new_code = tok.ch();
+ }
cidst->set_hyphenation_code(new_code);
if (cidst->get_translation()
&& cidst->get_translation()->get_translation_input())
...and at your test case (the UTF-8 version for readability in Savannah,
**not** bug-reproducibility).
$ cat EXPERIMENTS/lanteronial-utf8.groff
.ll 1n
lanteronial
lanter\[~o]nial
.hcode \[~o] õ
lanter\[~o]nial
You've only got the one `hcode` invocation, so that's good.
What was its path through the old code?
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=89623d044a207c1321bdf106d5f8d5d9e59b7ca1#n7278
Well, we have a bunch of validity checking/error handling first.
Eventually, if we've got two (mostly) valid arguments, we end up on line
7312.
If `csrc` is not zero, the source character is "ordinary". (If it _is_ zero,
it could be anything, like a horizontal motion escape sequence. But in valid
cases, if it's zero it's a special or indexed character.) And so that branch
should be taken for the "lanteronial" file. `new_code` becomes its value
(7315) and we skip to 7331, where the `charinfo` of the destination character
is set to that value.
We then worry about whether the destination character is "translated" (which I
**think** refers to `tr` translation but I haven't ruled out `trin` or `trnt`
translations instead, because it seems that no good item of terminology should
be permitted to apply to only one concept in a program), if it is, that new
code is immediately superseded by that of its translation (7334).
Then the function ends.
Okay, what about _after_ the "bad commit"?
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=a52141ac46eef95dd1f85e4c2e0a336affa9bcc9#n7278
We start off again at line 7312.
We don't make a decision about `csrc` right away. Instead we gather the
source character's hyphenation code immediately, if it has one (7314-7317),
then if the source character is special, we proceed as before (7319-7324).
But in this case, the source character is ordinary, so we check to see if the
character is being assigned to itself, and if so apply this "reflexive case"
(7329-7330). But we won't take that branch either because the test on line
7329 will fail: `csrc` is 245 decimal, but `cdst` is 0 because it's a special
character. We then hit line 7332 where we assign `new_code` to `cidst`. But
remember line 7317. `cisrc`'s hyphenation code would be zero, because because
that's the value it has when the formatter starts up ("troff -R"), and neither
"en.tmac" nor "latin1.tmac" ever assigned it a hyphenation code.
The bottom line is that there _is_ a logic change. Before "bad commit",
`new_code` got populated presumptively with the character code of the source
character, **if the character was ordinary**.
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=89623d044a207c1321bdf106d5f8d5d9e59b7ca1#n7315
In the new logic, it doesn't. It didn't occur to me that that assumption was
warranted. The character code might not be meaningful as a hyphenation code
in the language. `set_hyphenation_code()` has, for many years, been
aggressively assuming that it was, if you had the audacity to use an ordinary
character as the source character (second argument) in an `hcode` request.
I'd say the "bad commit" is a bug fix.
So we might retitle this ticket "[troff] behavior change in some .hcode calls
when an ordinary character is the second argument", and you can guess what my
proposed resolution is.
But I want to hear your take.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?66919>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
