Hi,
Jules Bertholet wrote:
> Makes two changes to the set of characters considered nonspacing:
>
> - Makes `Prepended_Concatenation_Mark`s no longer nonspacing.
> This matches the Unicode spec (which specifies these as taking up space
> in front of the characters they modify), and also aligns with
> glibc `wcwidth()`.
> - Makes `Default_Ignorable_Code_Point`s other than U+115F HANGUL CHOSEONG
> FILLER
> nonspacing. Unicode specifies
> (https://www.unicode.org/faq/unsup_char.html#3)
> that these "should be rendered as completely invisible (and non advancing,
> i.e.
> “zero width”), if not explicitly supported in rendering." U+115F is exempted
> because it is expected to be combined with other jamo to form a width-2
> Hangul
> syllable block.
Thank you for the suggestions.
Regarding the Prepended_Concatenation_Mark characters, I agree, and I am making
the changes; see below.
Regarding the Default_Ignorable_Code_Point characters: Making all of them
non-spacing would assign width 0 to the characters
U+115F HANGUL CHOSEONG FILLER
U+3164 HANGUL FILLER
U+FFA0 HALFWIDTH HANGUL FILLER
But this does not make sense to me:
* You exclude U+115F from your consideration, but the justification is weak:
Hangul composition of 3 characters in the range U+11xx creates a Hangul
syllable, and widths don't add up: 1 + 1 + 1 != 2 in the general case.
* The names of U+FFA0 being "HALFWIDTH HANGUL FILLER", it suggests that
"HANGUL FILLER" traditionally has width 2 and "HALFWIDTH HANGUL FILLER"
traditionally has width 1. If both had width 0, there would not be a need
for the HALFWIDTH one.
* glibc's wcwidth() function returns nonzero for these characters:
================================================================================
#define _GNU_SOURCE 1
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main ()
{
setlocale(LC_ALL,"");
printf ("%d %d %d\n", wcwidth(0x115F), wcwidth(0x3164), wcwidth(0xFFA0));
printf ("%d %d %d %d %d %d %d %d %d %d %d %d %d\n",
wcwidth(0x0600), wcwidth(0x0601), wcwidth(0x0602), wcwidth(0x0603),
wcwidth(0x0604), wcwidth(0x0605), wcwidth(0x06DD), wcwidth(0x070F),
wcwidth(0x0890), wcwidth(0x0891), wcwidth(0x08E2), wcwidth(0x110BD),
wcwidth(0x110CD));
}
================================================================================
produces:
$ LC_ALL=en_US.UTF-8 ./a.out
2 2 1
1 1 1 1 1 1 1 1 1 1 1 1 1
* Your argument by an FAQ is weak, since FAQs typically tend to simplify
things, so that they become easier to state or to understand.
Bruno
2024-02-12 Bruno Haible <[email protected]>
uniwidth/width: Assign width 1 to prepended concatenation marks.
Suggested by Jules Bertholet <[email protected]> in
<https://lists.gnu.org/archive/html/bug-gnulib/2024-02/msg00093.html>.
* lib/gen-uni-tables.c (is_nonspacing): For characters with property
Prepended_Concatenation_Mark, return false instead of true.
* lib/uniwidth/width0.h: Regenerated. This assigns width 1 to the
characters U+0600..U+0605, U+06DD, U+070F, U+0890..U+0891, U+08E2,
U+110BD, U+110CD.
* modules/uniwidth/width (configure.ac): Bump required libunistring
version.
* modules/uniwidth/u8-width (configure.ac): Likewise.
* modules/uniwidth/u8-strwidth (configure.ac): Likewise.
* modules/uniwidth/u16-width (configure.ac): Likewise.
* modules/uniwidth/u16-strwidth (configure.ac): Likewise.
* modules/uniwidth/u32-width (configure.ac): Likewise.
* modules/uniwidth/u32-strwidth (configure.ac): Likewise.
diff --git a/lib/gen-uni-tables.c b/lib/gen-uni-tables.c
index bc228105b4..c73ce06d64 100644
--- a/lib/gen-uni-tables.c
+++ b/lib/gen-uni-tables.c
@@ -6669,8 +6669,13 @@ fill_width (const char *width_filename)
/* The non-spacing attribute table consists of:
* Non-spacing characters; generated from PropList.txt or
"grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
- * Format control characters; generated from
- "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
+ * Format control characters, except for characters with property
+ Prepended_Concatenation_Mark; generated from
+ "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" and from
+ "grep Prepended_Concatenation_Mark PropList.txt".
+ Rationale for the Prepended_Concatenation_Mark exception:
+ The Unicode standard says "Unlike most other format characters,
+ however, they should be rendered with a visible glyph".
* Zero width characters; generated from
"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
* Hangul Jamo characters that have conjoining behaviour:
@@ -6695,7 +6700,9 @@ is_nonspacing (unsigned int ch)
{
return (unicode_attributes[ch].name != NULL
&& (get_bidi_category (ch) == UC_BIDI_NSM
- || is_category_Cc (ch) || is_category_Cf (ch)
+ || is_category_Cc (ch)
+ || (is_category_Cf (ch)
+ && !is_property_prepended_concatenation_mark (ch))
|| strncmp (unicode_attributes[ch].name, "ZERO WIDTH ", 11) == 0
|| (ch >= 0x1160 && ch <= 0x11A7) || (ch >= 0xD7B0 && ch <=
0xD7C6) /* jungseong */
|| (ch >= 0x11A8 && ch <= 0x11FF) || (ch >= 0xD7CB && ch <=
0xD7FB) /* jongseong */
diff --git a/lib/uniwidth/width0.h b/lib/uniwidth/width0.h
index 77954eb4d8..6cc35536ad 100644
--- a/lib/uniwidth/width0.h
+++ b/lib/uniwidth/width0.h
@@ -46,19 +46,19 @@ static const unsigned char nonspacing_table_data[48*64] = {
0x00, 0x00, 0xfe, 0xff, 0xff, 0xff, 0xff, 0xbf, /* 0x0580-0x05bf */
0xb6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x05c0-0x05ff */
/* 0x0600-0x07ff */
- 0x3f, 0x00, 0xff, 0x17, 0x00, 0x00, 0x00, 0x00, /* 0x0600-0x063f */
+ 0x00, 0x00, 0xff, 0x17, 0x00, 0x00, 0x00, 0x00, /* 0x0600-0x063f */
0x00, 0xf8, 0xff, 0xff, 0x00, 0x00, 0x01, 0x00, /* 0x0640-0x067f */
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x0680-0x06bf */
- 0x00, 0x00, 0xc0, 0xbf, 0x9f, 0x3d, 0x00, 0x00, /* 0x06c0-0x06ff */
- 0x00, 0x80, 0x02, 0x00, 0x00, 0x00, 0xff, 0xff, /* 0x0700-0x073f */
+ 0x00, 0x00, 0xc0, 0x9f, 0x9f, 0x3d, 0x00, 0x00, /* 0x06c0-0x06ff */
+ 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0xff, 0xff, /* 0x0700-0x073f */
0xff, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x0740-0x077f */
0x00, 0x00, 0x00, 0x00, 0xc0, 0xff, 0x01, 0x00, /* 0x0780-0x07bf */
0x00, 0x00, 0x00, 0x00, 0x00, 0xf8, 0x0f, 0x20, /* 0x07c0-0x07ff */
/* 0x0800-0x09ff */
0x00, 0x00, 0xc0, 0xfb, 0xef, 0x3e, 0x00, 0x00, /* 0x0800-0x083f */
0x00, 0x00, 0x00, 0x0e, 0x00, 0x00, 0x00, 0x00, /* 0x0840-0x087f */
- 0x00, 0x00, 0x03, 0xff, 0x00, 0x00, 0x00, 0x00, /* 0x0880-0x08bf */
- 0x00, 0xfc, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, /* 0x08c0-0x08ff */
+ 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, /* 0x0880-0x08bf */
+ 0x00, 0xfc, 0xff, 0xff, 0xfb, 0xff, 0xff, 0xff, /* 0x08c0-0x08ff */
0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14, /* 0x0900-0x093f */
0xfe, 0x21, 0xfe, 0x00, 0x0c, 0x00, 0x00, 0x00, /* 0x0940-0x097f */
0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, /* 0x0980-0x09bf */
@@ -273,8 +273,8 @@ static const unsigned char nonspacing_table_data[48*64] = {
/* 0x11000-0x111ff */
0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, /* 0x11000-0x1103f */
0x7f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x19, 0x80, /* 0x11040-0x1107f */
- 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x78, 0x26, /* 0x11080-0x110bf */
- 0x04, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x110c0-0x110ff */
+ 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x78, 0x06, /* 0x11080-0x110bf */
+ 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x110c0-0x110ff */
0x07, 0x00, 0x00, 0x00, 0x80, 0xef, 0x1f, 0x00, /* 0x11100-0x1113f */
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, /* 0x11140-0x1117f */
0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x7f, /* 0x11180-0x111bf */
diff --git a/modules/uniwidth/u16-strwidth b/modules/uniwidth/u16-strwidth
index 1a4ea001e9..f7ceb9272c 100644
--- a/modules/uniwidth/u16-strwidth
+++ b/modules/uniwidth/u16-strwidth
@@ -10,7 +10,7 @@ uniwidth/u16-width
unistr/u16-strlen
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u16-strwidth])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u16-strwidth])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_U16_STRWIDTH
diff --git a/modules/uniwidth/u16-width b/modules/uniwidth/u16-width
index 161898c93e..dfd08e3fec 100644
--- a/modules/uniwidth/u16-width
+++ b/modules/uniwidth/u16-width
@@ -10,7 +10,7 @@ uniwidth/width
unistr/u16-mbtouc-unsafe
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u16-width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u16-width])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_U16_WIDTH
diff --git a/modules/uniwidth/u32-strwidth b/modules/uniwidth/u32-strwidth
index 9c36df422a..a13836f1cc 100644
--- a/modules/uniwidth/u32-strwidth
+++ b/modules/uniwidth/u32-strwidth
@@ -10,7 +10,7 @@ uniwidth/u32-width
unistr/u32-strlen
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u32-strwidth])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u32-strwidth])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_U32_STRWIDTH
diff --git a/modules/uniwidth/u32-width b/modules/uniwidth/u32-width
index 34f25fea76..d90d9b9a20 100644
--- a/modules/uniwidth/u32-width
+++ b/modules/uniwidth/u32-width
@@ -9,7 +9,7 @@ uniwidth/base
uniwidth/width
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u32-width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u32-width])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_U32_WIDTH
diff --git a/modules/uniwidth/u8-strwidth b/modules/uniwidth/u8-strwidth
index 303ee5c010..26857ae4b0 100644
--- a/modules/uniwidth/u8-strwidth
+++ b/modules/uniwidth/u8-strwidth
@@ -10,7 +10,7 @@ uniwidth/u8-width
unistr/u8-strlen
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u8-strwidth])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u8-strwidth])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_U8_STRWIDTH
diff --git a/modules/uniwidth/u8-width b/modules/uniwidth/u8-width
index 36df3afe3f..46f5e4e014 100644
--- a/modules/uniwidth/u8-width
+++ b/modules/uniwidth/u8-width
@@ -10,7 +10,7 @@ uniwidth/width
unistr/u8-mbtouc-unsafe
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/u8-width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/u8-width])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_U8_WIDTH
diff --git a/modules/uniwidth/width b/modules/uniwidth/width
index 30973445b0..dc028317a1 100644
--- a/modules/uniwidth/width
+++ b/modules/uniwidth/width
@@ -13,7 +13,7 @@ uniwidth/base
streq
configure.ac:
-gl_LIBUNISTRING_MODULE([1.1], [uniwidth/width])
+gl_LIBUNISTRING_MODULE([1.2], [uniwidth/width])
Makefile.am:
if LIBUNISTRING_COMPILE_UNIWIDTH_WIDTH