Hi Benno, > By staring at the code and trying to understand whether I am missing > something, why that c == 0xf4 is there.
I see. For consistency with the other occurrences of this idiom lib/unistr/u8-check.c:66: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) lib/unistr/u8-mblen.c:56: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) lib/unistr/u8-mbtouc-aux.c:98: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) lib/unistr/u8-mbtouc-unsafe-aux.c:98: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90)) lib/unistr/u8-mbtouc-unsafe.c:109: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90)) lib/unistr/u8-mbtouc.c:108: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) lib/unistr/u8-mbtoucr.c:95: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) lib/unistr/u8-prev.c:68: && (c_4 < 0xf4 || (c_4 == 0xf4 && c_3 < 0x90))) lib/unistr/u8-strmblen.c:51: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) lib/unistr/u8-strmbtouc.c:63: && (c < 0xf4 || (c == 0xf4 && s[1] < 0x90))) I prefer to keep the test in, but commented out. I apply the patch below. > I was looking at the code to see how conversion from multibyte to > wide character is done, whether it's worth it to give nano its own > copy. There is quite some overhead at the beginning, and nano doesn't > need to do the checks on m. Also, I want nano's version to return -1 > for codes beyond U+10FFFF. The mbtowc from glibc returns 4 for codes > from U+110000 to U+1FFFFF, but those are not valid UTF-8 sequences, Indeed, it's registered as a glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=2373 https://sourceware.org/bugzilla/show_bug.cgi?id=26034 > so nano would need to do another check afterward. Yup. In GNU gettext, I've had abort()s due to the fact that some multibyte-to- Unicode conversion allowed characters up to U+7FFFFFFF and a subsequent check for a valid Unicode character (<= U+0010FFFF) failed. You're right, this is worth documenting. Bruno 2021-03-22 Benno Schulenberg <[email protected]> (tiny change) mbrtowc: Remove a redundant condition. * lib/mbrtowc-impl-utf8.h: There is no need to check for c == 0xf4 when !(c < 0xf4), as ten lines earlier c <= 0xf4 was established. diff --git a/lib/mbrtowc-impl-utf8.h b/lib/mbrtowc-impl-utf8.h index 4f3bbb3..58006d3 100644 --- a/lib/mbrtowc-impl-utf8.h +++ b/lib/mbrtowc-impl-utf8.h @@ -96,7 +96,7 @@ if ((c2 ^ 0x80) < 0x40 && (c >= 0xf1 || c2 >= 0x90) - && (c < 0xf4 || (c == 0xf4 && c2 < 0x90))) + && (c < 0xf4 || (/* c == 0xf4 && */ c2 < 0x90))) { if (m == 2) goto incomplete; 2021-03-22 Bruno Haible <[email protected]> doc: Mention an open glibc bug. * doc/posix-functions/mbrtowc.texi: Mention the possible out-of-range wchar_t values returned by this function on glibc. * doc/posix-functions/mbtowc.texi: Likewise. diff --git a/doc/posix-functions/mbrtowc.texi b/doc/posix-functions/mbrtowc.texi index 897e4da..291207e 100644 --- a/doc/posix-functions/mbrtowc.texi +++ b/doc/posix-functions/mbrtowc.texi @@ -44,6 +44,12 @@ Solaris 9. Portability problems not fixed by Gnulib: @itemize @item +In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF +(that is, beyond 0x0010FFFF) on some platforms: +@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373 +@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034 +glibc 2.33. +@item On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and therefore cannot accommodate all Unicode characters. However, the ISO C11 function @code{mbrtoc32}, provided by Gnulib module diff --git a/doc/posix-functions/mbtowc.texi b/doc/posix-functions/mbtowc.texi index e7e5376..855f825 100644 --- a/doc/posix-functions/mbtowc.texi +++ b/doc/posix-functions/mbtowc.texi @@ -16,6 +16,12 @@ Android 4.4. Portability problems not fixed by Gnulib: @itemize @item +In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF +(that is, beyond 0x0010FFFF) on some platforms: +@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373 +@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034 +glibc 2.33. +@item This function accumulates hidden state on some platforms: glibc 2.8 (see @url{https://sourceware.org/bugzilla/show_bug.cgi?id=9674}). @item
