Hi,

Jules Bertholet wrote:
> Makes two changes to the set of characters considered nonspacing:
> 
> - Makes `Prepended_Concatenation_Mark`s no longer nonspacing.
>   This matches the Unicode spec (which specifies these as taking up space
>   in front of the characters they modify), and also aligns with
>   glibc `wcwidth()`.
> - Makes `Default_Ignorable_Code_Point`s other than U+115F HANGUL CHOSEONG 
> FILLER
>   nonspacing. Unicode specifies 
> (https://www.unicode.org/faq/unsup_char.html#3)
>   that these "should be rendered as completely invisible (and non advancing, 
> i.e.
>   “zero width”), if not explicitly supported in rendering." U+115F is exempted
>   because it is expected to be combined with other jamo to form a width-2 
> Hangul
>   syllable block.

Thank you for the suggestions.

Regarding the Prepended_Concatenation_Mark characters, I agree, and I am making
the changes; see below.

Regarding the Default_Ignorable_Code_Point characters: Making all of them
non-spacing would assign width 0 to the characters
  U+115F HANGUL CHOSEONG FILLER
  U+3164 HANGUL FILLER
  U+FFA0 HALFWIDTH HANGUL FILLER
But this does not make sense to me:

  * You exclude U+115F from your consideration, but the justification is weak:
    Hangul composition of 3 characters in the range U+11xx creates a Hangul
    syllable, and widths don't add up: 1 + 1 + 1 != 2 in the general case.

  * The names of U+FFA0 being "HALFWIDTH HANGUL FILLER", it suggests that
    "HANGUL FILLER" traditionally has width 2 and "HALFWIDTH HANGUL FILLER"
    traditionally has width 1. If both had width 0, there would not be a need
    for the HALFWIDTH one.

  * glibc's wcwidth() function returns nonzero for these characters:
================================================================================
#define _GNU_SOURCE 1
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main ()
{
  setlocale(LC_ALL,"");
  printf ("%d %d %d\n", wcwidth(0x115F), wcwidth(0x3164), wcwidth(0xFFA0));
  printf ("%d %d %d %d %d %d %d %d %d %d %d %d %d\n",
          wcwidth(0x0600), wcwidth(0x0601),  wcwidth(0x0602), wcwidth(0x0603),
          wcwidth(0x0604), wcwidth(0x0605),  wcwidth(0x06DD), wcwidth(0x070F),
          wcwidth(0x0890), wcwidth(0x0891),  wcwidth(0x08E2), wcwidth(0x110BD),
          wcwidth(0x110CD));
}
================================================================================
    produces:
    $ LC_ALL=en_US.UTF-8 ./a.out 
    2 2 1
    1 1 1 1 1 1 1 1 1 1 1 1 1

  * Your argument by an FAQ is weak, since FAQs typically tend to simplify
    things, so that they become easier to state or to understand.

Bruno


2024-02-12  Bruno Haible  <br...@clisp.org>

        uniwidth/width: Assign width 1 to prepended concatenation marks.
        Suggested by Jules Bertholet <julesbertho...@quoi.xyz> in
        <https://lists.gnu.org/archive/html/bug-gnulib/2024-02/msg00093.html>.
        * lib/gen-uni-tables.c (is_nonspacing): For characters with property
        Prepended_Concatenation_Mark, return false instead of true.
        * lib/uniwidth/width0.h: Regenerated. This assigns width 1 to the
        characters U+0600..U+0605, U+06DD, U+070F, U+0890..U+0891, U+08E2,
        U+110BD, U+110CD.
        * modules/uniwidth/width (configure.ac): Bump required libunistring
        version.
        * modules/uniwidth/u8-width (configure.ac): Likewise.
        * modules/uniwidth/u8-strwidth (configure.ac): Likewise.
        * modules/uniwidth/u16-width (configure.ac): Likewise.
        * modules/uniwidth/u16-strwidth (configure.ac): Likewise.
        * modules/uniwidth/u32-width (configure.ac): Likewise.
        * modules/uniwidth/u32-strwidth (configure.ac): Likewise.

diff --git a/lib/gen-uni-tables.c b/lib/gen-uni-tables.c
index bc228105b4..c73ce06d64 100644
--- a/lib/gen-uni-tables.c
+++ b/lib/gen-uni-tables.c
@@ -6669,8 +6669,13 @@ fill_width (const char *width_filename)
 /* The non-spacing attribute table consists of:
    * Non-spacing characters; generated from PropList.txt or
      "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
-   * Format control characters; generated from
-     "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
+   * Format control characters, except for characters with property
+     Prepended_Concatenation_Mark; generated from
+     "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" and from
+     "grep Prepended_Concatenation_Mark PropList.txt".
+     Rationale for the Prepended_Concatenation_Mark exception:
+     The Unicode standard says "Unlike most other format characters,
+     however, they should be rendered with a visible glyph".
    * Zero width characters; generated from
      "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
    * Hangul Jamo characters that have conjoining behaviour:
@@ -6695,7 +6700,9 @@ is_nonspacing (unsigned int ch)
 {
   return (unicode_attributes[ch].name != NULL
           && (get_bidi_category (ch) == UC_BIDI_NSM
-              || is_category_Cc (ch) || is_category_Cf (ch)
+              || is_category_Cc (ch)
+              || (is_category_Cf (ch)
+                  && !is_property_prepended_concatenation_mark (ch))
               || strncmp (unicode_attributes[ch].name, "ZERO WIDTH ", 11) == 0
               || (ch >= 0x1160 && ch <= 0x11A7) || (ch >= 0xD7B0 && ch <= 
0xD7C6) /* jungseong */
               || (ch >= 0x11A8 && ch <= 0x11FF) || (ch >= 0xD7CB && ch <= 
0xD7FB) /* jongseong */
diff --git a/lib/uniwidth/width0.h b/lib/uniwidth/width0.h
index 77954eb4d8..6cc35536ad 100644
--- a/lib/uniwidth/width0.h
+++ b/lib/uniwidth/width0.h
@@ -46,19 +46,19 @@ static const unsigned char nonspacing_table_data[48*64] = {
   0x00, 0x00, 0xfe, 0xff, 0xff, 0xff, 0xff, 0xbf, /* 0x0580-0x05bf */
   0xb6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x05c0-0x05ff */
   /* 0x0600-0x07ff */
-  0x3f, 0x00, 0xff, 0x17, 0x00, 0x00, 0x00, 0x00, /* 0x0600-0x063f */
+  0x00, 0x00, 0xff, 0x17, 0x00, 0x00, 0x00, 0x00, /* 0x0600-0x063f */
   0x00, 0xf8, 0xff, 0xff, 0x00, 0x00, 0x01, 0x00, /* 0x0640-0x067f */
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x0680-0x06bf */
-  0x00, 0x00, 0xc0, 0xbf, 0x9f, 0x3d, 0x00, 0x00, /* 0x06c0-0x06ff */
-  0x00, 0x80, 0x02, 0x00, 0x00, 0x00, 0xff, 0xff, /* 0x0700-0x073f */
+  0x00, 0x00, 0xc0, 0x9f, 0x9f, 0x3d, 0x00, 0x00, /* 0x06c0-0x06ff */
+  0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0xff, 0xff, /* 0x0700-0x073f */
   0xff, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x0740-0x077f */
   0x00, 0x00, 0x00, 0x00, 0xc0, 0xff, 0x01, 0x00, /* 0x0780-0x07bf */
   0x00, 0x00, 0x00, 0x00, 0x00, 0xf8, 0x0f, 0x20, /* 0x07c0-0x07ff */
   /* 0x0800-0x09ff */
   0x00, 0x00, 0xc0, 0xfb, 0xef, 0x3e, 0x00, 0x00, /* 0x0800-0x083f */
   0x00, 0x00, 0x00, 0x0e, 0x00, 0x00, 0x00, 0x00, /* 0x0840-0x087f */
-  0x00, 0x00, 0x03, 0xff, 0x00, 0x00, 0x00, 0x00, /* 0x0880-0x08bf */
-  0x00, 0xfc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x08c0-0x08ff */
+  0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, /* 0x0880-0x08bf */
+  0x00, 0xfc, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xff, /* 0x08c0-0x08ff */
   0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14, /* 0x0900-0x093f */
   0xfe, 0x21, 0xfe, 0x00, 0x0c, 0x00, 0x00, 0x00, /* 0x0940-0x097f */
   0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, /* 0x0980-0x09bf */
@@ -273,8 +273,8 @@ static const unsigned char nonspacing_table_data[48*64] = {
   /* 0x11000-0x111ff */
   0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, /* 0x11000-0x1103f */
   0x7f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x19, 0x80, /* 0x11040-0x1107f */
-  0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x78, 0x26, /* 0x11080-0x110bf */
-  0x04, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x110c0-0x110ff */
+  0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x78, 0x06, /* 0x11080-0x110bf */
+  0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x110c0-0x110ff */
   0x07, 0x00, 0x00, 0x00, 0x80, 0xef, 0x1f, 0x00, /* 0x11100-0x1113f */
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, /* 0x11140-0x1117f */
   0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x7f, /* 0x11180-0x111bf */
diff --git a/modules/uniwidth/u16-strwidth b/modules/uniwidth/u16-strwidth
index 1a4ea001e9..f7ceb9272c 100644
--- a/modules/uniwidth/u16-strwidth
+++ b/modules/uniwidth/u16-strwidth
@@ -10,7 +10,7 @@ uniwidth/u16-width
 unistr/u16-strlen
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u16-strwidth])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u16-strwidth])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_U16_STRWIDTH
diff --git a/modules/uniwidth/u16-width b/modules/uniwidth/u16-width
index 161898c93e..dfd08e3fec 100644
--- a/modules/uniwidth/u16-width
+++ b/modules/uniwidth/u16-width
@@ -10,7 +10,7 @@ uniwidth/width
 unistr/u16-mbtouc-unsafe
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u16-width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u16-width])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_U16_WIDTH
diff --git a/modules/uniwidth/u32-strwidth b/modules/uniwidth/u32-strwidth
index 9c36df422a..a13836f1cc 100644
--- a/modules/uniwidth/u32-strwidth
+++ b/modules/uniwidth/u32-strwidth
@@ -10,7 +10,7 @@ uniwidth/u32-width
 unistr/u32-strlen
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u32-strwidth])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u32-strwidth])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_U32_STRWIDTH
diff --git a/modules/uniwidth/u32-width b/modules/uniwidth/u32-width
index 34f25fea76..d90d9b9a20 100644
--- a/modules/uniwidth/u32-width
+++ b/modules/uniwidth/u32-width
@@ -9,7 +9,7 @@ uniwidth/base
 uniwidth/width
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u32-width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u32-width])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_U32_WIDTH
diff --git a/modules/uniwidth/u8-strwidth b/modules/uniwidth/u8-strwidth
index 303ee5c010..26857ae4b0 100644
--- a/modules/uniwidth/u8-strwidth
+++ b/modules/uniwidth/u8-strwidth
@@ -10,7 +10,7 @@ uniwidth/u8-width
 unistr/u8-strlen
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u8-strwidth])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u8-strwidth])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_U8_STRWIDTH
diff --git a/modules/uniwidth/u8-width b/modules/uniwidth/u8-width
index 36df3afe3f..46f5e4e014 100644
--- a/modules/uniwidth/u8-width
+++ b/modules/uniwidth/u8-width
@@ -10,7 +10,7 @@ uniwidth/width
 unistr/u8-mbtouc-unsafe
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u8-width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u8-width])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_U8_WIDTH
diff --git a/modules/uniwidth/width b/modules/uniwidth/width
index 30973445b0..dc028317a1 100644
--- a/modules/uniwidth/width
+++ b/modules/uniwidth/width
@@ -13,7 +13,7 @@ uniwidth/base
 streq
 
 configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/width])
 
 Makefile.am:
 if LIBUNISTRING_COMPILE_UNIWIDTH_WIDTH




Reply via email to