From wchar_t to char32_t

Bruno Haible Mon, 19 Jun 2023 11:05:49 -0700

For many years, processing multibyte strings required the mbrtowc functions
and the 'wchar_t' type.


The major limitation of this API is that on Windows platforms (Cygwin as well
as native Windows) and in 32-bit mode on AIX, a 'wchar_t' is limited to 16 bits,
and this causes all sorts of bugs with characters outside the Unicode BMP.

Before 2010, we thought that this would only impact rarely used Chinese
characters. But nowadays, emoticons are in Unicode, outside the BMP, and
are frequently used on the web. So, supporting characters outside the BMP
has become more important.

In 2011, ISO C added the 'char32_t' type as a "32-bit wide character" type.
Meanwhile, many OSes have this type and the corresponding mbrtoc32 function.
Elements of this type are actual Unicode code points. The ISO C 11 standard
did only hint at it; but ISO C 23 actually requires it. All platforms that
have the mbrtoc32 function fulfil this requirement, and Gnulib's substitute
(module 'mbrtoc32') does so as well.

In particular, on glibc systems: since glibc 2.24, mbrtoc32 is identical to
mbrtowc. And the Gnulib convenience functions for char32_t characters
just delegate to the corresponding glibc functions for wchar_t wide characters.

So, we are now in a position to support characters outside the BMP in GNU
programs overall and in a portable and maintainable way.

I added some documentation a month ago:
https://www.gnu.org/software/gnulib/manual/html_node/Strings-and-Characters.html

The migration from wchar_t to char32_t can be done by writing 'char32_t'
instead of 'wchar_t', and replacing function names according to this table:

  wchar_t             char32_t
  -------             --------
  7.31.2
  *wprintf            --             rarely used
  *wscanf             --             rarely used
  7.31.3
  fgetwc              --             rarely used, see "The wchar_t mess"
  fputwc              --             rarely used
  7.31.4.1
  wcsto{f,d,ld}       --             rarely used
  wcsto{l,ll,ul,ull}  --             rarely used
  7.31.4.2
  wcscpy              u32_strcpy
  wcsncpy             u32_strncpy
  wmemcpy             u32_cpy
  wmemmove            u32_move
  7.31.4.3
  wcscat              u32_strcat
  wcsncat             u32_strncat
  7.31.4.4
  wcscmp              u32_strcmp
  wcscoll             u32_strcoll
  wcsncmp             u32_strncmp
  wcsxfrm             --             rarely used
  wmemcmp             u32_cmp
  7.31.4.5/6
  wcschr              u32_strchr
  wcscspn             u32_strcspn
  wcspbrk             u32_strpbrk
  wcsrchr             u32_strrchr
  wcsspn              u32_strspn
  wcsstr              u32_strstr
  wcstok              u32_strtok
  wmemchr             u32_chr
  7.31.4.7
  wcslen              u32_strlen
  wmemset             u32_set
  7.31.5
  wcsftime            --             rarely used
  7.31.6.1
  btowc               btoc32
  wctob               c32tob
  7.31.6.2
  mbsinit             mbsinit
  7.31.6.3
  mbrlen              --             rarely used, use mbrtoc32 instead
  mbrtowc             mbrtoc32
  wcrtomb             c32rtomb
  7.31.6.4
  mbsrtowcs           mbsrtoc32s
  wcsrtombs           c32srtombs
  7.32.2.1
  iswalnum            c32isalnum
  iswalpha            c32isalpha
  iswblank            c32isblank
  iswcntrl            c32iscntrl
  iswdigit            c32isdigit
  iswgraph            c32isgraph
  iswlower            c32islower
  iswprint            c32isprint
  iswpunct            c32ispunct
  iswspace            c32isspace
  iswupper            c32isupper
  iswxdigit           c32isxdigit
  7.32.2.2
  iswctype            --             rarely used
  wctype              --             rarely used
  7.32.3.1
  towlower            c32tolower
  towupper            c32toupper
  7.32.3.2
  towctrans           --             rarely used
  wctrans             --             rarely used
  POSIX
  wcwidth             c32width
  wcswidth            c32swidth


Paul has already started this migration, in diffutils:
https://git.savannah.gnu.org/gitweb/?p=diffutils.git;a=commitdiff;h=a2e301b52cc5bdb44540aa66860dc59fa1fa5a89

In Gnulib, the following areas will need migration:

* lib/mbchar.h
  lib/mbiter.h
  lib/mbuiter.h
  Draft patch attached.

* lib/dfa.c
  lib/localeinfo.h
  lib/localeinfo.c
  Needs to be carefully done, so as to not break gawk.

* lib/regcomp.c
  lib/regexec.c
  lib/regex_internal.h
  lib/regex_internal.c
  Needs to be done in a way that is acceptable to glibc upstream.

* lib/fnmatch.c
  Likewise.

* lib/exclude.c

* lib/nstrftime.c

* lib/quotearg.c


Bruno

diff --git a/doc/strings.texi b/doc/strings.texi
index aa0830f1a5..73cb56120f 100644
--- a/doc/strings.texi
+++ b/doc/strings.texi
@@ -44,7 +44,7 @@
 functions, standardized by ISO C and POSIX, that assume this
 representation of strings.
 
-An @emph{character encoding}, or @emph{encoding} for short, describes
+A @emph{character encoding}, or @emph{encoding} for short, describes
 how the elements of a character set are represented as a sequence of
 bytes.  For example, in the @code{ASCII} encoding, the UNDERSCORE
 character is represented by a single byte, with value 0x5F.  As another
diff --git a/lib/exclude.c b/lib/exclude.c
index 7bd0ec8c71..af204cd300 100644
--- a/lib/exclude.c
+++ b/lib/exclude.c
@@ -209,10 +209,10 @@ string_hasher_ci (void const *data, size_t n_buckets)
   for (mbui_init (iter, p); mbui_avail (iter); mbui_advance (iter))
     {
       mbchar_t m = mbui_cur (iter);
-      wchar_t wc;
+      char32_t wc;
 
       if (m.wc_valid)
-        wc = towlower (m.wc);
+        wc = c32tolower (m.wc);
       else
         wc = *m.ptr;
 
diff --git a/lib/mbchar.h b/lib/mbchar.h
index a2ff1d8b21..c183772cc6 100644
--- a/lib/mbchar.h
+++ b/lib/mbchar.h
@@ -17,10 +17,10 @@
 /* Written by Bruno Haible <[email protected]>.  */
 
 /* A multibyte character is a short subsequence of a char* string,
-   representing a single wide character.
+   representing a single 32-bit wide character.
 
-   We use multibyte characters instead of wide characters because of
-   the following goals:
+   We use multibyte characters instead of 32-bit wide characters because
+   of the following goals:
    1) correct multibyte handling, i.e. operate according to the LC_CTYPE
       locale,
    2) ease of maintenance, i.e. the maintainer needs not know all details
@@ -28,8 +28,7 @@
    3) don't fail grossly if the input is not in the encoding set by the
       locale, because often different encodings are in use in the same
       countries (ISO-8859-1/UTF-8, EUC-JP/Shift_JIS, ...),
-   4) fast in the case of ASCII characters,
-   5) portability, i.e. don't make unportable assumptions about wchar_t.
+   4) fast in the case of ASCII characters.
 
    Multibyte characters are only accessed through the mb* macros.
 
@@ -150,8 +149,7 @@
 #endif
 
 #include <string.h>
-#include <wchar.h>
-#include <wctype.h>
+#include <uchar.h>
 
 _GL_INLINE_HEADER_BEGIN
 #ifndef MBCHAR_INLINE
@@ -164,8 +162,8 @@ struct mbchar
 {
   const char *ptr;      /* pointer to current character */
   size_t bytes;         /* number of bytes of current character, > 0 */
-  bool wc_valid;        /* true if wc is a valid wide character */
-  wchar_t wc;           /* if wc_valid: the current character */
+  bool wc_valid;        /* true if wc is a valid 32-bit wide character */
+  char32_t wc;          /* if wc_valid: the current character */
   char buf[MBCHAR_BUF_SIZE]; /* room for the bytes, used for file input only */
 };
 
@@ -184,7 +182,7 @@ typedef struct mbchar mbchar_t;
 #define mb_cmp(mbc1, mbc2) \
   ((mbc1).wc_valid                                                      \
    ? ((mbc2).wc_valid                                                   \
-      ? (int) (mbc1).wc - (int) (mbc2).wc                               \
+      ? _GL_CMP ((mbc1).wc, (mbc2).wc)                                  \
       : -1)                                                             \
    : ((mbc2).wc_valid                                                   \
       ? 1                                                               \
@@ -196,7 +194,7 @@ typedef struct mbchar mbchar_t;
 #define mb_casecmp(mbc1, mbc2) \
   ((mbc1).wc_valid                                                      \
    ? ((mbc2).wc_valid                                                   \
-      ? (int) towlower ((mbc1).wc) - (int) towlower ((mbc2).wc)         \
+      ? _GL_CMP (c32tolower ((mbc1).wc), c32tolower ((mbc2).wc))        \
       : -1)                                                             \
    : ((mbc2).wc_valid                                                   \
       ? 1                                                               \
@@ -212,25 +210,25 @@ typedef struct mbchar mbchar_t;
      && memcmp ((mbc1).ptr, (mbc2).ptr, (mbc1).bytes) == 0)
 #define mb_caseequal(mbc1, mbc2) \
   ((mbc1).wc_valid && (mbc2).wc_valid                                   \
-   ? towlower ((mbc1).wc) == towlower ((mbc2).wc)                       \
+   ? c32tolower ((mbc1).wc) == c32tolower ((mbc2).wc)                   \
    : (mbc1).bytes == (mbc2).bytes                                       \
      && memcmp ((mbc1).ptr, (mbc2).ptr, (mbc1).bytes) == 0)
 
 /* <ctype.h>, <wctype.h> classification.  */
 #define mb_isascii(mbc) \
   ((mbc).wc_valid && (mbc).wc >= 0 && (mbc).wc <= 127)
-#define mb_isalnum(mbc) ((mbc).wc_valid && iswalnum ((mbc).wc))
-#define mb_isalpha(mbc) ((mbc).wc_valid && iswalpha ((mbc).wc))
-#define mb_isblank(mbc) ((mbc).wc_valid && iswblank ((mbc).wc))
-#define mb_iscntrl(mbc) ((mbc).wc_valid && iswcntrl ((mbc).wc))
-#define mb_isdigit(mbc) ((mbc).wc_valid && iswdigit ((mbc).wc))
-#define mb_isgraph(mbc) ((mbc).wc_valid && iswgraph ((mbc).wc))
-#define mb_islower(mbc) ((mbc).wc_valid && iswlower ((mbc).wc))
-#define mb_isprint(mbc) ((mbc).wc_valid && iswprint ((mbc).wc))
-#define mb_ispunct(mbc) ((mbc).wc_valid && iswpunct ((mbc).wc))
-#define mb_isspace(mbc) ((mbc).wc_valid && iswspace ((mbc).wc))
-#define mb_isupper(mbc) ((mbc).wc_valid && iswupper ((mbc).wc))
-#define mb_isxdigit(mbc) ((mbc).wc_valid && iswxdigit ((mbc).wc))
+#define mb_isalnum(mbc) ((mbc).wc_valid && c32isalnum ((mbc).wc))
+#define mb_isalpha(mbc) ((mbc).wc_valid && c32isalpha ((mbc).wc))
+#define mb_isblank(mbc) ((mbc).wc_valid && c32isblank ((mbc).wc))
+#define mb_iscntrl(mbc) ((mbc).wc_valid && c32iscntrl ((mbc).wc))
+#define mb_isdigit(mbc) ((mbc).wc_valid && c32isdigit ((mbc).wc))
+#define mb_isgraph(mbc) ((mbc).wc_valid && c32isgraph ((mbc).wc))
+#define mb_islower(mbc) ((mbc).wc_valid && c32islower ((mbc).wc))
+#define mb_isprint(mbc) ((mbc).wc_valid && c32isprint ((mbc).wc))
+#define mb_ispunct(mbc) ((mbc).wc_valid && c32ispunct ((mbc).wc))
+#define mb_isspace(mbc) ((mbc).wc_valid && c32isspace ((mbc).wc))
+#define mb_isupper(mbc) ((mbc).wc_valid && c32isupper ((mbc).wc))
+#define mb_isxdigit(mbc) ((mbc).wc_valid && c32isxdigit ((mbc).wc))
 
 /* Extra <wchar.h> function.  */
 
@@ -238,12 +236,12 @@ typedef struct mbchar mbchar_t;
 #define MB_UNPRINTABLE_WIDTH 1
 
 MBCHAR_INLINE int
-mb_width_aux (wint_t wc)
+mb_width_aux (char32_t wc)
 {
-  int w = wcwidth (wc);
+  int w = c32width (wc);
   /* For unprintable characters, arbitrarily return 0 for control characters
      and MB_UNPRINTABLE_WIDTH otherwise.  */
-  return (w >= 0 ? w : iswcntrl (wc) ? 0 : MB_UNPRINTABLE_WIDTH);
+  return (w >= 0 ? w : c32iscntrl (wc) ? 0 : MB_UNPRINTABLE_WIDTH);
 }
 
 #define mb_width(mbc) \
diff --git a/lib/mbfile.h b/lib/mbfile.h
index 3482f394b9..7c6d70fcae 100644
--- a/lib/mbfile.h
+++ b/lib/mbfile.h
@@ -110,7 +110,7 @@ mbfile_multi_getc (struct mbchar *mbc, struct mbfile_multi *mbf)
     {
       /* These characters are part of the basic character set.  ISO C 99
          guarantees that their wide character code is identical to their
-         char code.  */
+         char code.  The 32-bit wide character code is the same as well.  */
       mbc->wc = mbc->buf[0] = mbf->buf[0];
       mbc->wc_valid = true;
       mbc->ptr = &mbc->buf[0];
@@ -136,7 +136,7 @@ mbfile_multi_getc (struct mbchar *mbc, struct mbfile_multi *mbf)
          behaviour will clobber it.  */
       mbstate_t backup_state = mbf->state;
 
-      bytes = mbrtowc (&mbc->wc, &mbf->buf[0], mbf->bufcount, &mbf->state);
+      bytes = mbrtoc32 (&mbc->wc, &mbf->buf[0], mbf->bufcount, &mbf->state);
 
       if (bytes == (size_t) -1)
         {
@@ -178,7 +178,7 @@ mbfile_multi_getc (struct mbchar *mbc, struct mbfile_multi *mbf)
         {
           if (bytes == 0)
             {
-              /* A null wide character was encountered.  */
+              /* A null 32-bit wide character was encountered.  */
               bytes = 1;
               assert (mbf->buf[0] == '\0');
               assert (mbc->wc == 0);
diff --git a/lib/mbiter.h b/lib/mbiter.h
index 7b41870b55..93bad990a1 100644
--- a/lib/mbiter.h
+++ b/lib/mbiter.h
@@ -90,7 +90,7 @@
 #include <assert.h>
 #include <stddef.h>
 #include <string.h>
-#include <wchar.h>
+#include <uchar.h>
 
 #include "mbchar.h"
 
@@ -106,11 +106,11 @@ struct mbiter_multi
   mbstate_t state;      /* if in_shift: current shift state */
   bool next_done;       /* true if mbi_avail has already filled the following */
   struct mbchar cur;    /* the current character:
-        const char *cur.ptr             pointer to current character
+        const char *cur.ptr          pointer to current character
         The following are only valid after mbi_avail.
-        size_t cur.bytes                number of bytes of current character
-        bool cur.wc_valid               true if wc is a valid wide character
-        wchar_t cur.wc                  if wc_valid: the current character
+        size_t cur.bytes             number of bytes of current character
+        bool cur.wc_valid            true if wc is a valid 32-bit wide character
+        char32_t cur.wc              if wc_valid: the current character
         */
 };
 
@@ -136,8 +136,8 @@ mbiter_multi_next (struct mbiter_multi *iter)
       assert (mbsinit (&iter->state));
       iter->in_shift = true;
     with_shift:
-      iter->cur.bytes = mbrtowc (&iter->cur.wc, iter->cur.ptr,
-                                 iter->limit - iter->cur.ptr, &iter->state);
+      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
+                                  iter->limit - iter->cur.ptr, &iter->state);
       if (iter->cur.bytes == (size_t) -1)
         {
           /* An invalid multibyte sequence was encountered.  */
diff --git a/lib/mbscasestr.c b/lib/mbscasestr.c
index d92b847ba7..0753aeb864 100644
--- a/lib/mbscasestr.c
+++ b/lib/mbscasestr.c
@@ -64,7 +64,7 @@ knuth_morris_pratt_multibyte (const char *haystack, const char *needle,
       {
         mb_copy (&needle_mbchars[j], &mbui_cur (iter));
         if (needle_mbchars[j].wc_valid)
-          needle_mbchars[j].wc = towlower (needle_mbchars[j].wc);
+          needle_mbchars[j].wc = c32tolower (needle_mbchars[j].wc);
       }
   }
 
@@ -152,7 +152,7 @@ knuth_morris_pratt_multibyte (const char *haystack, const char *needle,
 
         mb_copy (&c, &mbui_cur (phaystack));
         if (c.wc_valid)
-          c.wc = towlower (c.wc);
+          c.wc = c32tolower (c.wc);
         if (mb_equal (needle_mbchars[j], c))
           {
             j++;
@@ -237,7 +237,7 @@ mbscasestr (const char *haystack, const char *needle)
 
           mb_copy (&b, &mbui_cur (iter_needle));
           if (b.wc_valid)
-            b.wc = towlower (b.wc);
+            b.wc = c32tolower (b.wc);
 
           mbui_init (iter_haystack, haystack);
           for (;; mbui_advance (iter_haystack))
@@ -279,7 +279,7 @@ mbscasestr (const char *haystack, const char *needle)
               comparison_count++;
               mb_copy (&c, &mbui_cur (iter_haystack));
               if (c.wc_valid)
-                c.wc = towlower (c.wc);
+                c.wc = c32tolower (c.wc);
               if (mb_equal (c, b))
                 /* The first character matches.  */
                 {
diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 7a619f19e1..632def10c5 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -114,11 +114,11 @@ struct mbuiter_multi
   mbstate_t state;      /* if in_shift: current shift state */
   bool next_done;       /* true if mbui_avail has already filled the following */
   struct mbchar cur;    /* the current character:
-        const char *cur.ptr             pointer to current character
+        const char *cur.ptr          pointer to current character
         The following are only valid after mbui_avail.
-        size_t cur.bytes                number of bytes of current character
-        bool cur.wc_valid               true if wc is a valid wide character
-        wchar_t cur.wc                  if wc_valid: the current character
+        size_t cur.bytes             number of bytes of current character
+        bool cur.wc_valid            true if wc is a valid 32-bit wide character
+        wchar_t cur.wc               if wc_valid: the current character
         */
 };
 
@@ -144,9 +144,9 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
       assert (mbsinit (&iter->state));
       iter->in_shift = true;
     with_shift:
-      iter->cur.bytes = mbrtowc (&iter->cur.wc, iter->cur.ptr,
-                                 strnlen1 (iter->cur.ptr, MB_CUR_MAX),
-                                 &iter->state);
+      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
+                                  strnlen1 (iter->cur.ptr, MB_CUR_MAX),
+                                  &iter->state);
       if (iter->cur.bytes == (size_t) -1)
         {
           /* An invalid multibyte sequence was encountered.  */
diff --git a/modules/exclude b/modules/exclude
index 841dd826cd..93bfdaf4cf 100644
--- a/modules/exclude
+++ b/modules/exclude
@@ -7,6 +7,7 @@ lib/exclude.c
 
 Depends-on:
 assert-h
+c32tolower
 filename
 fnmatch
 fopen-gnu
diff --git a/modules/mbchar b/modules/mbchar
index b1fa0fa4ac..51a1c8e1b9 100644
--- a/modules/mbchar
+++ b/modules/mbchar
@@ -10,12 +10,21 @@ Depends-on:
 extensions
 extern-inline
 stdbool
-wchar
-wctype-h
-iswblank
-iswdigit
-iswxdigit
-wcwidth
+uchar
+c32isalnum
+c32isalpha
+c32isblank
+c32iscntrl
+c32isdigit
+c32isgraph
+c32islower
+c32isprint
+c32ispunct
+c32isspace
+c32isupper
+c32isxdigit
+c32tolower
+c32width
 memcmp
 
 configure.ac:
diff --git a/modules/mbiter b/modules/mbiter
index 42305d62cd..082afd42f2 100644
--- a/modules/mbiter
+++ b/modules/mbiter
@@ -10,9 +10,9 @@ m4/mbrtowc.m4
 Depends-on:
 extern-inline
 mbchar
-mbrtowc
+mbrtoc32
 mbsinit
-wchar
+uchar
 stdbool
 
 configure.ac:
diff --git a/modules/mbscasestr b/modules/mbscasestr
index 2892c2fc2b..672cac8960 100644
--- a/modules/mbscasestr
+++ b/modules/mbscasestr
@@ -11,6 +11,7 @@ stdbool
 string
 mbslen
 malloca
+c32tolower
 strnlen
 
 configure.ac:
diff --git a/modules/mbuiter b/modules/mbuiter
index b9e41031d5..63a11ff2f5 100644
--- a/modules/mbuiter
+++ b/modules/mbuiter
@@ -10,9 +10,9 @@ m4/mbrtowc.m4
 Depends-on:
 extern-inline
 mbchar
-mbrtowc
+mbrtoc32
 mbsinit
-wchar
+uchar
 stdbool
 strnlen1

From wchar_t to char32_t

Reply via email to