URL: <https://savannah.gnu.org/bugs/?67757>
Summary: [{ja,zh}.tmac] incorrect character flags for
\C'[CJKpostpunct]'?
Group: GNU roff
Submitter: gbranden
Submitted: Tue 02 Dec 2025 07:30:57 AM UTC
Category: Macro package - others/general
Severity: 3 - Normal
Item Group: Incorrect behaviour
Status: Need Info
Privacy: Public
Assigned to: cjwatson
Open/Closed: Open
Discussion Lock: Unlocked
Planned Release: None
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: Tue 02 Dec 2025 07:30:57 AM UTC By: G. Branden Robinson <gbranden>
Looping in Colin Watson, Fumitoshi UKAI, and Darcy SHEN, who contributed these
files to _groff_ about 10 years ago.
From a [https://lists.gnu.org/archive/html/groff/2025-12/msg00003.html recent
mail of mine to the _groff_ mailing list]:
...
groff_diff(7):
.cflags n c1 c2 ...
Assign properties encoded by the number n to characters c1,
c2, and so on. Characters, whether ordinary, special, or
indexed, have certain associated properties. The first
argument is the sum of the desired flags and the remaining
arguments are the characters to be assigned those
properties. Spaces need not separate the cn arguments.
Any argument cn can be a character class defined with the
class request rather than an individual character.
The non‐negative integer n is the sum of any of the
following. Some combinations are nonsensical, such as “33”
(1 + 32).
The remaining values were implemented for East Asian
language support; those who use alphabetic scripts
exclusively can disregard them.
128 Prohibit a break before the character, but allow a
break after the character. This works only in
combination with values 256 and 512 and has no
effect otherwise. Initially, no characters have
this property.
256 Prohibit a break after the character, but allow a
break before the character. This works only in
combination with values 128 and 512 and has no
effect otherwise. Initially, no characters have
this property.
512 Allow a break before or after the character. This
works only in combination with values 128 and 256
and has no effect otherwise. Initially, no
characters have this property.
In contrast to values 2 and 4, the values 128, 256, and 512
work pairwise. If, for example, the left character has
value 512, and the right character 128, no break will be
automatically inserted between them. If we use value 6
instead for the left character, a break after the character
can’t be suppressed since the neighboring character on the
right doesn’t get examined.
...
Here's the sum total of `cflags` matches in the
parts of our source that contain macro packages.
$ git grep -w cflags contrib tmac
...
tmac/ja.tmac:.cflags 128 \C'[CJKprepunct]'
tmac/ja.tmac:.cflags 266 \C'[CJKpostpunct]'
tmac/ja.tmac:.cflags 512 \C'[CJKnormal]'
...
tmac/zh.tmac:.cflags 128 \C'[CJKprepunct]'
tmac/zh.tmac:.cflags 266 \C'[CJKpostpunct]'
tmac/zh.tmac:.cflags 512 \C'[CJKnormal]'
(Do I spot a bug in the Chinese and Japanese "postpunct" flags here?[3])
...
3] Shouldn't these CJK glyphs be "256" instead of "266"?
Factored into powers of 2, 266=256+8+2.
Let's use groff Git HEAD to ask the Chinese macro package what
characters are in this class.
$ ./build/test-groff -Tutf8 -mzh
.pchar \C'[CJKpostpunct]'
character class '[CJKpostpunct]'
defined at: file name: "/home/branden/src/GIT/groff/build/../tmac/zh.tmac",
line number: 39
contains ranges: U+201C U+3008 U+300A U+300C U+300E U+3010 U+FF08
contains nested classes: (none)
Should these CJK post-punctuation glyphs have the flags 2 ("allows
breaks before the character") and 8 ("overlaps copies of itself
horizontally")? Really? For U+201C?
$ for c in 201C 3008 300A 300C 300E 3010 FF08; do unicode U+$c | head -n 3;
done
U+201C LEFT DOUBLE QUOTATION MARK
UTF-8: e2 80 9c UTF-16BE: 201c Decimal: “ Octal: \020034
“
U+3008 LEFT ANGLE BRACKET
UTF-8: e3 80 88 UTF-16BE: 3008 Decimal: 〈 Octal: \030010
〈
U+300A LEFT DOUBLE ANGLE BRACKET
UTF-8: e3 80 8a UTF-16BE: 300a Decimal: 《 Octal: \030012
《
U+300C LEFT CORNER BRACKET
UTF-8: e3 80 8c UTF-16BE: 300c Decimal: 「 Octal: \030014
「
U+300E LEFT WHITE CORNER BRACKET
UTF-8: e3 80 8e UTF-16BE: 300e Decimal: 『 Octal: \030016
『
U+3010 LEFT BLACK LENTICULAR BRACKET
UTF-8: e3 80 90 UTF-16BE: 3010 Decimal: 【 Octal: \030020
【
U+FF08 FULLWIDTH LEFT PARENTHESIS
UTF-8: ef bc 88 UTF-16BE: ff08 Decimal: ( Octal: \0177410
(
I see yet another bug here. How about you?
...
[6] Checking commit history and Werner's commit message after
introducing the CJK-motivated character flags, I'm increasingly
confident that a typo snuck in.
commit 38e6049d0d1ad035e6a562c285dc6017530f5745
Author: Werner LEMBERG <[email protected]>
Date: Sat Dec 18 09:13:18 2010 +0000
Improve CJK support with new values for `.cflags'.
This patch introduces three new values to `.cflags':
don't break before character: 128
don't break after character: 256
allow inter-character break: 512
...
-.cflags 2 \C'[CJKprepunct]'
-.cflags 4 \C'[CJKpostpunct]'
-.cflags 66 \C'[CJKnormal]'
+.cflags 128 \C'[CJKprepunct]'
+.cflags 266 \C'[CJKpostpunct]'
+.cflags 512 \C'[CJKnormal]'
...
...because '66' and '266' look more similar in decimal than binary.
It looks to me like these '266' values in the "ja.tmac" and "zh.tmac" files
should be '256'.
If I'm wrong, can someone explain how/why?
Thanks for any light you can shed.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?67757>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
