[bug #67757] [{ja, zh}.tmac] incorrect character flags for \C'[CJKpostpunct]'?

G. Branden Robinson Mon, 01 Dec 2025 23:31:30 -0800

URL:
  <https://savannah.gnu.org/bugs/?67757>


                 Summary: [{ja,zh}.tmac] incorrect character flags for
\C'[CJKpostpunct]'?
                   Group: GNU roff
               Submitter: gbranden
               Submitted: Tue 02 Dec 2025 07:30:57 AM UTC
                Category: Macro package - others/general
                Severity: 3 - Normal
              Item Group: Incorrect behaviour
                  Status: Need Info
                 Privacy: Public
             Assigned to: cjwatson
             Open/Closed: Open
         Discussion Lock: Unlocked
         Planned Release: None


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Tue 02 Dec 2025 07:30:57 AM UTC By: G. Branden Robinson <gbranden>
Looping in Colin Watson, Fumitoshi UKAI, and Darcy SHEN, who contributed these
files to _groff_ about 10 years ago.

From a [https://lists.gnu.org/archive/html/groff/2025-12/msg00003.html recent
mail of mine to the _groff_ mailing list]:


...
groff_diff(7):
     .cflags n c1 c2 ...
             Assign properties encoded by the number n to characters c1,
             c2, and so on.  Characters, whether ordinary, special, or
             indexed, have certain associated properties.  The first
             argument is the sum of the desired flags and the remaining
             arguments are the characters to be assigned those
             properties.  Spaces need not separate the cn arguments.
             Any argument cn can be a character class defined with the
             class request rather than an individual character.

             The non‐negative integer n is the sum of any of the
             following.  Some combinations are nonsensical, such as “33”
             (1 + 32).
             The remaining values were implemented for East Asian
             language support; those who use alphabetic scripts
             exclusively can disregard them.

             128    Prohibit a break before the character, but allow a
                    break after the character.  This works only in
                    combination with values 256 and 512 and has no
                    effect otherwise.  Initially, no characters have
                    this property.

             256    Prohibit a break after the character, but allow a
                    break before the character.  This works only in
                    combination with values 128 and 512 and has no
                    effect otherwise.  Initially, no characters have
                    this property.

             512    Allow a break before or after the character.  This
                    works only in combination with values 128 and 256
                    and has no effect otherwise.  Initially, no
                    characters have this property.

             In contrast to values 2 and 4, the values 128, 256, and 512
             work pairwise.  If, for example, the left character has
             value 512, and the right character 128, no break will be
             automatically inserted between them.  If we use value 6
             instead for the left character, a break after the character
             can’t be suppressed since the neighboring character on the
             right doesn’t get examined.
...
Here's the sum total of `cflags` matches in the
parts of our source that contain macro packages.

$ git grep -w cflags contrib tmac
...
tmac/ja.tmac:.cflags 128 \C'[CJKprepunct]'
tmac/ja.tmac:.cflags 266 \C'[CJKpostpunct]'
tmac/ja.tmac:.cflags 512 \C'[CJKnormal]'
...
tmac/zh.tmac:.cflags 128 \C'[CJKprepunct]'
tmac/zh.tmac:.cflags 266 \C'[CJKpostpunct]'
tmac/zh.tmac:.cflags 512 \C'[CJKnormal]'

(Do I spot a bug in the Chinese and Japanese "postpunct" flags here?[3])
...
3] Shouldn't these CJK glyphs be "256" instead of "266"?

Factored into powers of 2, 266=256+8+2.

Let's use groff Git HEAD to ask the Chinese macro package what
characters are in this class.

$ ./build/test-groff -Tutf8 -mzh
.pchar \C'[CJKpostpunct]'
character class '[CJKpostpunct]'
  defined at: file name: "/home/branden/src/GIT/groff/build/../tmac/zh.tmac",

line number: 39
  contains ranges: U+201C U+3008 U+300A U+300C U+300E U+3010 U+FF08
  contains nested classes: (none)

Should these CJK post-punctuation glyphs have the flags 2 ("allows
breaks before the character") and 8 ("overlaps copies of itself
horizontally")?  Really?  For U+201C?

$ for c in 201C 3008 300A 300C 300E 3010 FF08; do unicode U+$c | head -n 3;
done
U+201C LEFT DOUBLE QUOTATION MARK
UTF-8: e2 80 9c UTF-16BE: 201c Decimal: &#8220; Octal: \020034
“
U+3008 LEFT ANGLE BRACKET
UTF-8: e3 80 88 UTF-16BE: 3008 Decimal: &#12296; Octal: \030010
〈
U+300A LEFT DOUBLE ANGLE BRACKET
UTF-8: e3 80 8a UTF-16BE: 300a Decimal: &#12298; Octal: \030012
《
U+300C LEFT CORNER BRACKET
UTF-8: e3 80 8c UTF-16BE: 300c Decimal: &#12300; Octal: \030014
「
U+300E LEFT WHITE CORNER BRACKET
UTF-8: e3 80 8e UTF-16BE: 300e Decimal: &#12302; Octal: \030016
『
U+3010 LEFT BLACK LENTICULAR BRACKET
UTF-8: e3 80 90 UTF-16BE: 3010 Decimal: &#12304; Octal: \030020
【
U+FF08 FULLWIDTH LEFT PARENTHESIS
UTF-8: ef bc 88 UTF-16BE: ff08 Decimal: &#65288; Octal: \0177410
（

I see yet another bug here.  How about you?
...
[6] Checking commit history and Werner's commit message after
    introducing the CJK-motivated character flags, I'm increasingly
    confident that a typo snuck in.

commit 38e6049d0d1ad035e6a562c285dc6017530f5745
Author: Werner LEMBERG <[email protected]>
Date:   Sat Dec 18 09:13:18 2010 +0000

    Improve CJK support with new values for `.cflags'.

    This patch introduces three new values to `.cflags':

      don't break before character: 128
      don't break after character:  256
      allow inter-character break:  512

...
-.cflags 2 \C'[CJKprepunct]'
-.cflags 4 \C'[CJKpostpunct]'
-.cflags 66 \C'[CJKnormal]'
+.cflags 128 \C'[CJKprepunct]'
+.cflags 266 \C'[CJKpostpunct]'
+.cflags 512 \C'[CJKnormal]'
...

...because '66' and '266' look more similar in decimal than binary.


It looks to me like these '266' values in the "ja.tmac" and "zh.tmac" files
should be '256'.

If I'm wrong, can someone explain how/why?

Thanks for any light you can shed.







    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?67757>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #67757] [{ja, zh}.tmac] incorrect character flags for \C'[CJKpostpunct]'?

Reply via email to