Hi, Jules Bertholet wrote: > Makes two changes to the set of characters considered nonspacing: > > - Makes `Prepended_Concatenation_Mark`s no longer nonspacing. > This matches the Unicode spec (which specifies these as taking up space > in front of the characters they modify), and also aligns with > glibc `wcwidth()`. > - Makes `Default_Ignorable_Code_Point`s other than U+115F HANGUL CHOSEONG > FILLER > nonspacing. Unicode specifies > (https://www.unicode.org/faq/unsup_char.html#3) > that these "should be rendered as completely invisible (and non advancing, > i.e. > “zero width”), if not explicitly supported in rendering." U+115F is exempted > because it is expected to be combined with other jamo to form a width-2 > Hangul > syllable block.
Thank you for the suggestions. Regarding the Prepended_Concatenation_Mark characters, I agree, and I am making the changes; see below. Regarding the Default_Ignorable_Code_Point characters: Making all of them non-spacing would assign width 0 to the characters U+115F HANGUL CHOSEONG FILLER U+3164 HANGUL FILLER U+FFA0 HALFWIDTH HANGUL FILLER But this does not make sense to me: * You exclude U+115F from your consideration, but the justification is weak: Hangul composition of 3 characters in the range U+11xx creates a Hangul syllable, and widths don't add up: 1 + 1 + 1 != 2 in the general case. * The names of U+FFA0 being "HALFWIDTH HANGUL FILLER", it suggests that "HANGUL FILLER" traditionally has width 2 and "HALFWIDTH HANGUL FILLER" traditionally has width 1. If both had width 0, there would not be a need for the HALFWIDTH one. * glibc's wcwidth() function returns nonzero for these characters: ================================================================================ #define _GNU_SOURCE 1 #include <stdio.h> #include <wchar.h> #include <locale.h> int main () { setlocale(LC_ALL,""); printf ("%d %d %d\n", wcwidth(0x115F), wcwidth(0x3164), wcwidth(0xFFA0)); printf ("%d %d %d %d %d %d %d %d %d %d %d %d %d\n", wcwidth(0x0600), wcwidth(0x0601), wcwidth(0x0602), wcwidth(0x0603), wcwidth(0x0604), wcwidth(0x0605), wcwidth(0x06DD), wcwidth(0x070F), wcwidth(0x0890), wcwidth(0x0891), wcwidth(0x08E2), wcwidth(0x110BD), wcwidth(0x110CD)); } ================================================================================ produces: $ LC_ALL=en_US.UTF-8 ./a.out 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * Your argument by an FAQ is weak, since FAQs typically tend to simplify things, so that they become easier to state or to understand. Bruno 2024-02-12 Bruno Haible <br...@clisp.org> uniwidth/width: Assign width 1 to prepended concatenation marks. Suggested by Jules Bertholet <julesbertho...@quoi.xyz> in <https://lists.gnu.org/archive/html/bug-gnulib/2024-02/msg00093.html>. * lib/gen-uni-tables.c (is_nonspacing): For characters with property Prepended_Concatenation_Mark, return false instead of true. * lib/uniwidth/width0.h: Regenerated. This assigns width 1 to the characters U+0600..U+0605, U+06DD, U+070F, U+0890..U+0891, U+08E2, U+110BD, U+110CD. * modules/uniwidth/width (configure.ac): Bump required libunistring version. * modules/uniwidth/u8-width (configure.ac): Likewise. * modules/uniwidth/u8-strwidth (configure.ac): Likewise. * modules/uniwidth/u16-width (configure.ac): Likewise. * modules/uniwidth/u16-strwidth (configure.ac): Likewise. * modules/uniwidth/u32-width (configure.ac): Likewise. * modules/uniwidth/u32-strwidth (configure.ac): Likewise. diff --git a/lib/gen-uni-tables.c b/lib/gen-uni-tables.c index bc228105b4..c73ce06d64 100644 --- a/lib/gen-uni-tables.c +++ b/lib/gen-uni-tables.c @@ -6669,8 +6669,13 @@ fill_width (const char *width_filename) /* The non-spacing attribute table consists of: * Non-spacing characters; generated from PropList.txt or "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt" - * Format control characters; generated from - "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" + * Format control characters, except for characters with property + Prepended_Concatenation_Mark; generated from + "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" and from + "grep Prepended_Concatenation_Mark PropList.txt". + Rationale for the Prepended_Concatenation_Mark exception: + The Unicode standard says "Unlike most other format characters, + however, they should be rendered with a visible glyph". * Zero width characters; generated from "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt" * Hangul Jamo characters that have conjoining behaviour: @@ -6695,7 +6700,9 @@ is_nonspacing (unsigned int ch) { return (unicode_attributes[ch].name != NULL && (get_bidi_category (ch) == UC_BIDI_NSM - || is_category_Cc (ch) || is_category_Cf (ch) + || is_category_Cc (ch) + || (is_category_Cf (ch) + && !is_property_prepended_concatenation_mark (ch)) || strncmp (unicode_attributes[ch].name, "ZERO WIDTH ", 11) == 0 || (ch >= 0x1160 && ch <= 0x11A7) || (ch >= 0xD7B0 && ch <= 0xD7C6) /* jungseong */ || (ch >= 0x11A8 && ch <= 0x11FF) || (ch >= 0xD7CB && ch <= 0xD7FB) /* jongseong */ diff --git a/lib/uniwidth/width0.h b/lib/uniwidth/width0.h index 77954eb4d8..6cc35536ad 100644 --- a/lib/uniwidth/width0.h +++ b/lib/uniwidth/width0.h @@ -46,19 +46,19 @@ static const unsigned char nonspacing_table_data[48*64] = { 0x00, 0x00, 0xfe, 0xff, 0xff, 0xff, 0xff, 0xbf, /* 0x0580-0x05bf */ 0xb6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x05c0-0x05ff */ /* 0x0600-0x07ff */ - 0x3f, 0x00, 0xff, 0x17, 0x00, 0x00, 0x00, 0x00, /* 0x0600-0x063f */ + 0x00, 0x00, 0xff, 0x17, 0x00, 0x00, 0x00, 0x00, /* 0x0600-0x063f */ 0x00, 0xf8, 0xff, 0xff, 0x00, 0x00, 0x01, 0x00, /* 0x0640-0x067f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x0680-0x06bf */ - 0x00, 0x00, 0xc0, 0xbf, 0x9f, 0x3d, 0x00, 0x00, /* 0x06c0-0x06ff */ - 0x00, 0x80, 0x02, 0x00, 0x00, 0x00, 0xff, 0xff, /* 0x0700-0x073f */ + 0x00, 0x00, 0xc0, 0x9f, 0x9f, 0x3d, 0x00, 0x00, /* 0x06c0-0x06ff */ + 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0xff, 0xff, /* 0x0700-0x073f */ 0xff, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x0740-0x077f */ 0x00, 0x00, 0x00, 0x00, 0xc0, 0xff, 0x01, 0x00, /* 0x0780-0x07bf */ 0x00, 0x00, 0x00, 0x00, 0x00, 0xf8, 0x0f, 0x20, /* 0x07c0-0x07ff */ /* 0x0800-0x09ff */ 0x00, 0x00, 0xc0, 0xfb, 0xef, 0x3e, 0x00, 0x00, /* 0x0800-0x083f */ 0x00, 0x00, 0x00, 0x0e, 0x00, 0x00, 0x00, 0x00, /* 0x0840-0x087f */ - 0x00, 0x00, 0x03, 0xff, 0x00, 0x00, 0x00, 0x00, /* 0x0880-0x08bf */ - 0x00, 0xfc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x08c0-0x08ff */ + 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, /* 0x0880-0x08bf */ + 0x00, 0xfc, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xff, /* 0x08c0-0x08ff */ 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14, /* 0x0900-0x093f */ 0xfe, 0x21, 0xfe, 0x00, 0x0c, 0x00, 0x00, 0x00, /* 0x0940-0x097f */ 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, /* 0x0980-0x09bf */ @@ -273,8 +273,8 @@ static const unsigned char nonspacing_table_data[48*64] = { /* 0x11000-0x111ff */ 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, /* 0x11000-0x1103f */ 0x7f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x19, 0x80, /* 0x11040-0x1107f */ - 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x78, 0x26, /* 0x11080-0x110bf */ - 0x04, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x110c0-0x110ff */ + 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x78, 0x06, /* 0x11080-0x110bf */ + 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x110c0-0x110ff */ 0x07, 0x00, 0x00, 0x00, 0x80, 0xef, 0x1f, 0x00, /* 0x11100-0x1113f */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, /* 0x11140-0x1117f */ 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x7f, /* 0x11180-0x111bf */ diff --git a/modules/uniwidth/u16-strwidth b/modules/uniwidth/u16-strwidth index 1a4ea001e9..f7ceb9272c 100644 --- a/modules/uniwidth/u16-strwidth +++ b/modules/uniwidth/u16-strwidth @@ -10,7 +10,7 @@ uniwidth/u16-width unistr/u16-strlen configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u16-strwidth]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u16-strwidth]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_U16_STRWIDTH diff --git a/modules/uniwidth/u16-width b/modules/uniwidth/u16-width index 161898c93e..dfd08e3fec 100644 --- a/modules/uniwidth/u16-width +++ b/modules/uniwidth/u16-width @@ -10,7 +10,7 @@ uniwidth/width unistr/u16-mbtouc-unsafe configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u16-width]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u16-width]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_U16_WIDTH diff --git a/modules/uniwidth/u32-strwidth b/modules/uniwidth/u32-strwidth index 9c36df422a..a13836f1cc 100644 --- a/modules/uniwidth/u32-strwidth +++ b/modules/uniwidth/u32-strwidth @@ -10,7 +10,7 @@ uniwidth/u32-width unistr/u32-strlen configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u32-strwidth]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u32-strwidth]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_U32_STRWIDTH diff --git a/modules/uniwidth/u32-width b/modules/uniwidth/u32-width index 34f25fea76..d90d9b9a20 100644 --- a/modules/uniwidth/u32-width +++ b/modules/uniwidth/u32-width @@ -9,7 +9,7 @@ uniwidth/base uniwidth/width configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u32-width]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u32-width]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_U32_WIDTH diff --git a/modules/uniwidth/u8-strwidth b/modules/uniwidth/u8-strwidth index 303ee5c010..26857ae4b0 100644 --- a/modules/uniwidth/u8-strwidth +++ b/modules/uniwidth/u8-strwidth @@ -10,7 +10,7 @@ uniwidth/u8-width unistr/u8-strlen configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u8-strwidth]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u8-strwidth]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_U8_STRWIDTH diff --git a/modules/uniwidth/u8-width b/modules/uniwidth/u8-width index 36df3afe3f..46f5e4e014 100644 --- a/modules/uniwidth/u8-width +++ b/modules/uniwidth/u8-width @@ -10,7 +10,7 @@ uniwidth/width unistr/u8-mbtouc-unsafe configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u8-width]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u8-width]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_U8_WIDTH diff --git a/modules/uniwidth/width b/modules/uniwidth/width index 30973445b0..dc028317a1 100644 --- a/modules/uniwidth/width +++ b/modules/uniwidth/width @@ -13,7 +13,7 @@ uniwidth/base streq configure.ac: -gl_LIBUNISTRING_MODULE([1.1], [uniwidth/width]) +gl_LIBUNISTRING_MODULE([1.2], [uniwidth/width]) Makefile.am: if LIBUNISTRING_COMPILE_UNIWIDTH_WIDTH