Re: [PATCH] mbrtowc: remove a redundant condition

Bruno Haible Mon, 22 Mar 2021 05:49:25 -0700

Hi Benno,

> By staring at the code and trying to understand whether I am missing
> something, why that c == 0xf4 is there.


I see. For consistency with the other occurrences of this idiom

lib/unistr/u8-check.c:66:                  && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))
lib/unistr/u8-mblen.c:56:                  && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))
lib/unistr/u8-mbtouc-aux.c:98:                              && (c < 0xf4 || (c 
== 0xf4 && s[1] < 0x90)))
lib/unistr/u8-mbtouc-unsafe-aux.c:98:                              && (c < 0xf4 
|| (c == 0xf4 && s[1] < 0x90))
lib/unistr/u8-mbtouc-unsafe.c:109:                              && (c < 0xf4 || 
(c == 0xf4 && s[1] < 0x90))
lib/unistr/u8-mbtouc.c:108:                              && (c < 0xf4 || (c == 
0xf4 && s[1] < 0x90)))
lib/unistr/u8-mbtoucr.c:95:                  && (c < 0xf4 || (c == 0xf4 && s[1] 
< 0x90)))
lib/unistr/u8-prev.c:68:                            && (c_4 < 0xf4 || (c_4 == 
0xf4 && c_3 < 0x90)))
lib/unistr/u8-strmblen.c:51:              && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))
lib/unistr/u8-strmbtouc.c:63:              && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))

I prefer to keep the test in, but commented out. I apply the patch below.

> I was looking at the code to see how conversion from multibyte to
> wide character is done, whether it's worth it to give nano its own
> copy.  There is quite some overhead at the beginning, and nano doesn't
> need to do the checks on m.  Also, I want nano's version to return -1
> for codes beyond U+10FFFF.  The mbtowc from glibc returns 4 for codes
> from U+110000 to U+1FFFFF, but those are not valid UTF-8 sequences,

Indeed, it's registered as a glibc bug:
  https://sourceware.org/bugzilla/show_bug.cgi?id=2373
  https://sourceware.org/bugzilla/show_bug.cgi?id=26034

> so nano would need to do another check afterward.

Yup. In GNU gettext, I've had abort()s due to the fact that some multibyte-to-
Unicode conversion allowed characters up to U+7FFFFFFF and a subsequent check
for a valid Unicode character (<= U+0010FFFF) failed.

You're right, this is worth documenting.

Bruno


2021-03-22  Benno Schulenberg  <[email protected]>  (tiny change)

        mbrtowc: Remove a redundant condition.
        * lib/mbrtowc-impl-utf8.h: There is no need to check for c == 0xf4
        when !(c < 0xf4), as ten lines earlier c <= 0xf4 was established.

diff --git a/lib/mbrtowc-impl-utf8.h b/lib/mbrtowc-impl-utf8.h
index 4f3bbb3..58006d3 100644
--- a/lib/mbrtowc-impl-utf8.h
+++ b/lib/mbrtowc-impl-utf8.h
@@ -96,7 +96,7 @@
 
                     if ((c2 ^ 0x80) < 0x40
                         && (c >= 0xf1 || c2 >= 0x90)
-                        && (c < 0xf4 || (c == 0xf4 && c2 < 0x90)))
+                        && (c < 0xf4 || (/* c == 0xf4 && */ c2 < 0x90)))
                       {
                         if (m == 2)
                           goto incomplete;


2021-03-22  Bruno Haible  <[email protected]>

        doc: Mention an open glibc bug.
        * doc/posix-functions/mbrtowc.texi: Mention the possible out-of-range
        wchar_t values returned by this function on glibc.
        * doc/posix-functions/mbtowc.texi: Likewise.

diff --git a/doc/posix-functions/mbrtowc.texi b/doc/posix-functions/mbrtowc.texi
index 897e4da..291207e 100644
--- a/doc/posix-functions/mbrtowc.texi
+++ b/doc/posix-functions/mbrtowc.texi
@@ -44,6 +44,12 @@ Solaris 9.
 Portability problems not fixed by Gnulib:
 @itemize
 @item
+In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF
+(that is, beyond 0x0010FFFF) on some platforms:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034
+glibc 2.33.
+@item
 On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and
 therefore cannot accommodate all Unicode characters.
 However, the ISO C11 function @code{mbrtoc32}, provided by Gnulib module
diff --git a/doc/posix-functions/mbtowc.texi b/doc/posix-functions/mbtowc.texi
index e7e5376..855f825 100644
--- a/doc/posix-functions/mbtowc.texi
+++ b/doc/posix-functions/mbtowc.texi
@@ -16,6 +16,12 @@ Android 4.4.
 Portability problems not fixed by Gnulib:
 @itemize
 @item
+In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF
+(that is, beyond 0x0010FFFF) on some platforms:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034
+glibc 2.33.
+@item
 This function accumulates hidden state on some platforms:
 glibc 2.8 (see @url{https://sourceware.org/bugzilla/show_bug.cgi?id=9674}).
 @item

Re: [PATCH] mbrtowc: remove a redundant condition

Reply via email to