Yesterday I wrote: > Now I got interested in > - whether the mb*iter* modules actually implement MEE, > - what's the behavioural difference between MEE and SEE, function by > function.
It turns out that the mb*iter* modules, so far, implement MEE for incomplete multibyte characters at the end of the string only. (That is the case where mbrtoc32 returns (size_t)(-2).) For incomplete multibyte characters inside a string — that is the case where mbrtoc32 returns (size_t)(-1) —, these modules still implement SEE. Ouch. The effect is visible in several mbs* functions: mbslen, mbsnlen mbschr, mbsrchr mbscspn, mbspbrk, mbsspn mbsstr, mbscasestr mbs_startswith, mbs_endswith. This series of patches implements MEE also for the case of incomplete multibyte characters inside a string, as shown in https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 128 table 3-11. 2026-05-25 Bruno Haible <[email protected]> mbuiterf: Implement multi-byte per encoding error (MEE) consistently. * lib/mbuiterf.h: Include mbiter-aux.h. (struct mbuif_state): Add field is_utf8. (mbuiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbuif_init): Initialize the field is_utf8. * modules/mbuiterf (Depends-on): Add mbiter-aux. * tests/test-mbslen.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add more test cases with incomplete characters. * tests/test-mbschr2.sh: Renamed from tests/test-mbschr.sh. * tests/test-mbschr2.c: Renamed from tests/test-mbschr.c. * tests/test-mbschr1.sh: New file, based on tests/test-mbmemcasecmp-3.sh. * tests/test-mbschr1.c: New file. * modules/mbschr-tests (Files): Update accordingly. Add locale-en.m4, locale-fr.m4. (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8. (Makefile.am): Arrange to compile test-mbschr1 and test-mbschr2 and to run test-mbschr1.sh, test-mbschr2.sh. * tests/test-mbsrchr2.sh: Renamed from tests/test-mbsrchr.sh. * tests/test-mbsrchr2.c: Renamed from tests/test-mbsrchr.c. * tests/test-mbsrchr1.sh: New file, based on tests/test-mbmemcasecmp-3.sh. * tests/test-mbsrchr1.c: New file. * modules/mbsrchr-tests (Files): Update accordingly. Add locale-en.m4, locale-fr.m4. (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8. (Makefile.am): Arrange to compile test-mbsrchr1 and test-mbsrchr2 and to run test-mbsrchr1.sh, test-mbsrchr2.sh. * tests/test-mbscspn.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbspbrk.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbsspn.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. 2026-05-25 Bruno Haible <[email protected]> mbuiter: Implement multi-byte per encoding error (MEE) consistently. * lib/mbuiter.h: Include mbiter-aux.h. (struct mbuiter_multi): Add field is_utf8. (mbuiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbui_init): Initialize the field is_utf8. * modules/mbuiter (Depends-on): Add mbiter-aux. * tests/test-mbsstr2.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbsstr1.c: Update comments. * tests/test-mbsstr3.c: Likewise. * tests/test-mbscasestr2.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbscasestr1.c: Update comments. * tests/test-mbscasestr3.c: Likewise. * tests/test-mbscasestr4.c: Likewise. 2026-05-25 Bruno Haible <[email protected]> mbiterf: Implement multi-byte per encoding error (MEE) consistently. * lib/mbiterf.h: Include mbiter-aux.h. (struct mbif_state): Add field is_utf8. (mbiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbif_init): Initialize the field is_utf8. * modules/mbiterf (Depends-on): Add mbiter-aux. * tests/test-mbsnlen.c (main): Add test cases with incomplete characters not at the end of the string. 2026-05-25 Bruno Haible <[email protected]> mbiter: Implement multi-byte per encoding error (MEE) consistently. * lib/mbiter.h: Include mbiter-aux.h. (struct mbiter_multi): Add field is_utf8. (mbiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbi_init): Initialize the field is_utf8. * modules/mbiter (Depends-on): Add mbiter-aux. * tests/test-mbs_startswith2.c (main): Add test cases with incomplete characters not at the end of the string. * tests/test-mbs_endswith2.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters not at the end of the string. 2026-05-25 Bruno Haible <[email protected]> mbiter-aux: New module. * lib/mbiter-aux.h: New file. * lib/mbiter-aux.c: New file. mbiter_is_utf8 is based on lib/localeinfo.c. * modules/mbiter-aux: New file.
>From 86e2abad044fb89d3a848a9117b91e5f25946c1a Mon Sep 17 00:00:00 2001 From: Bruno Haible <[email protected]> Date: Mon, 25 May 2026 18:58:12 +0200 Subject: [PATCH 1/5] mbiter-aux: New module. * lib/mbiter-aux.h: New file. * lib/mbiter-aux.c: New file. mbiter_is_utf8 is based on lib/localeinfo.c. * modules/mbiter-aux: New file. --- ChangeLog | 8 +++++ lib/mbiter-aux.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++ lib/mbiter-aux.h | 44 +++++++++++++++++++++++++ modules/mbiter-aux | 31 ++++++++++++++++++ 4 files changed, 164 insertions(+) create mode 100644 lib/mbiter-aux.c create mode 100644 lib/mbiter-aux.h create mode 100644 modules/mbiter-aux diff --git a/ChangeLog b/ChangeLog index 7c1b291c75..7fdeebc8dd 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,11 @@ +2026-05-25 Bruno Haible <[email protected]> + + mbiter-aux: New module. + * lib/mbiter-aux.h: New file. + * lib/mbiter-aux.c: New file. mbiter_is_utf8 is based on + lib/localeinfo.c. + * modules/mbiter-aux: New file. + 2026-05-25 Waldemar Brodkorb <[email protected]> mbrtoc32: do not optimze for uClibc-ng diff --git a/lib/mbiter-aux.c b/lib/mbiter-aux.c new file mode 100644 index 0000000000..125e02ebe3 --- /dev/null +++ b/lib/mbiter-aux.c @@ -0,0 +1,81 @@ +/* Auxiliary functions for iterating through multibyte strings. + Copyright (C) 2026 Free Software Foundation, Inc. + + This file is free software: you can redistribute it and/or modify + it under the terms of the GNU Lesser General Public License as + published by the Free Software Foundation; either version 2.1 of the + License, or (at your option) any later version. + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <[email protected]>. */ + +#include <config.h> + +/* Specification. */ +#include "mbiter-aux.h" + +#include <uchar.h> + +bool +mbiter_is_utf8 (int *cache) +{ + if (*cache < 0) + { + /* UTF-8 is the only encoding in use which maps the bytes 0xC4 0x80 + to U+0100. (See libiconv/tests/*.TXT for all the mapping tables.) + We can assume that in this case, the char32_t encoding is Unicode + (not platform-dependent like for other locale encodings). */ + mbstate_t state; mbszero (&state); + char32_t wc; + *cache = (mbrtoc32 (&wc, "\xc4\x80", 2, &state) == 2 && wc == 0x100); + } + return *cache; +} + +/* If the current locale encoding is UTF-8 and a preceding + mbrtoc32 (&uc, S, N, &state) + invocation returned (size_t) -1, this function returns the number of + initial bytes that form a maximal subpart in the sense of + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127..129. + The result is >= 1, <= N. */ +size_t +mbiter_utf8_maximal_subpart (const char *s, size_t n) +{ + /* Based on lib/unistr/u8-mbtouc.c. */ + if (n >= 2) + { + unsigned char c = (unsigned char) *s; + if (c >= 0xe0) + { + if (c < 0xf0) + { + unsigned char c2 = (unsigned char) s[1]; + if ((c2 ^ 0x80) < 0x40 + && (c >= 0xe1 || c2 >= 0xa0) + && (c != 0xed || c2 < 0xa0)) + return 2; + } + else if (c <= 0xf4) + { + unsigned char c2 = (unsigned char) s[1]; + if ((c2 ^ 0x80) < 0x40 + && (c >= 0xf1 || c2 >= 0x90) + && (c < 0xf4 || (/* c == 0xf4 && */ c2 < 0x90))) + { + if (n >= 3 && ((unsigned char) s[2] ^ 0x80) < 0x40) + return 3; + else + return 2; + } + } + } + } + return 1; +} diff --git a/lib/mbiter-aux.h b/lib/mbiter-aux.h new file mode 100644 index 0000000000..972b4c0264 --- /dev/null +++ b/lib/mbiter-aux.h @@ -0,0 +1,44 @@ +/* Auxiliary functions for iterating through multibyte strings. + Copyright (C) 2026 Free Software Foundation, Inc. + + This file is free software: you can redistribute it and/or modify + it under the terms of the GNU Lesser General Public License as + published by the Free Software Foundation; either version 2.1 of the + License, or (at your option) any later version. + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <[email protected]>. */ + +#ifndef _MBITER_AUX_H +#define _MBITER_AUX_H 1 + +#include <stddef.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/* Determines whether the current locale encoding is UTF-8. + Stores the value in *CACHE, that should be pre-initialized with -1. */ +extern bool mbiter_is_utf8 (int *cache); + +/* If the current locale encoding is UTF-8 and a preceding + mbrtoc32 (&uc, S, N, &state) + invocation returned (size_t) -1, this function returns the number of + initial bytes that form a maximal subpart in the sense of + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127..129. + The result is >= 1, <= N. */ +extern size_t mbiter_utf8_maximal_subpart (const char *s, size_t n); + +#ifdef __cplusplus +} +#endif + +#endif /* _MBITER_AUX_H */ diff --git a/modules/mbiter-aux b/modules/mbiter-aux new file mode 100644 index 0000000000..55086bdc24 --- /dev/null +++ b/modules/mbiter-aux @@ -0,0 +1,31 @@ +Description: +Auxiliary functions for iterating through multibyte strings. + +Files: +lib/mbiter-aux.h +lib/mbiter-aux.c + +Depends-on: +mbrtoc32 +mbsinit +mbszero +bool + +configure.ac: + +Makefile.am: +lib_SOURCES += mbiter-aux.h mbiter-aux.c + +Include: +"mbiter-aux.h" + +Link: +$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise +$(MBRTOWC_LIB) +$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise + +License: +LGPLv2+ + +Maintainer: +all -- 2.54.0
>From 1429a919428e4dd697903e685476025034315541 Mon Sep 17 00:00:00 2001 From: Bruno Haible <[email protected]> Date: Mon, 25 May 2026 19:06:55 +0200 Subject: [PATCH 2/5] mbiter: Implement multi-byte per encoding error consistently. * lib/mbiter.h: Include mbiter-aux.h. (struct mbiter_multi): Add field is_utf8. (mbiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbi_init): Initialize the field is_utf8. * modules/mbiter (Depends-on): Add mbiter-aux. * tests/test-mbs_startswith2.c (main): Add test cases with incomplete characters not at the end of the string. * tests/test-mbs_endswith2.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters not at the end of the string. --- ChangeLog | 15 +++++++++++++++ lib/mbiter.h | 22 +++++++++++++++------- modules/mbiter | 1 + tests/test-mbs_endswith2.c | 17 +++++++++++++++++ tests/test-mbs_startswith2.c | 3 +++ 5 files changed, 51 insertions(+), 7 deletions(-) diff --git a/ChangeLog b/ChangeLog index 7fdeebc8dd..891109da37 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,18 @@ +2026-05-25 Bruno Haible <[email protected]> + + mbiter: Implement multi-byte per encoding error (MEE) consistently. + * lib/mbiter.h: Include mbiter-aux.h. + (struct mbiter_multi): Add field is_utf8. + (mbiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. + (mbi_init): Initialize the field is_utf8. + * modules/mbiter (Depends-on): Add mbiter-aux. + * tests/test-mbs_startswith2.c (main): Add test cases with incomplete + characters not at the end of the string. + * tests/test-mbs_endswith2.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add test cases with incomplete characters not at the end of the + string. + 2026-05-25 Bruno Haible <[email protected]> mbiter-aux: New module. diff --git a/lib/mbiter.h b/lib/mbiter.h index 78bcd39d17..d069f2773d 100644 --- a/lib/mbiter.h +++ b/lib/mbiter.h @@ -95,6 +95,7 @@ #include <wchar.h> #include "mbchar.h" +#include "mbiter-aux.h" _GL_INLINE_HEADER_BEGIN #ifndef MBITER_INLINE @@ -119,6 +120,7 @@ struct mbiter_multi before and after every mbiter_multi_next invocation. */ bool next_done; /* true if mbi_avail has already filled the following */ + int is_utf8; /* A cache of mbiter_is_utf8. */ struct mbchar cur; /* the current character: const char *cur.ptr pointer to current character The following are only valid after mbi_avail. @@ -155,14 +157,18 @@ mbiter_multi_next (struct mbiter_multi *iter) assert (mbsinit (&iter->state)); #if !GNULIB_MBRTOC32_REGULAR iter->in_shift = true; - with_shift: + with_shift:; #endif - iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, - iter->limit - iter->cur.ptr, &iter->state); + size_t avail_bytes = iter->limit - iter->cur.ptr; + iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, avail_bytes, + &iter->state); if (iter->cur.bytes == (size_t) -1) { /* An invalid multibyte sequence was encountered. */ - iter->cur.bytes = 1; + iter->cur.bytes = + (mbiter_is_utf8 (&iter->is_utf8) + ? mbiter_utf8_maximal_subpart (iter->cur.ptr, avail_bytes) + : 1); iter->cur.wc_valid = false; /* Allow the next invocation to continue from a sane state. */ #if !GNULIB_MBRTOC32_REGULAR @@ -173,7 +179,7 @@ mbiter_multi_next (struct mbiter_multi *iter) else if (iter->cur.bytes == (size_t) -2) { /* An incomplete multibyte character at the end. */ - iter->cur.bytes = iter->limit - iter->cur.ptr; + iter->cur.bytes = avail_bytes; iter->cur.wc_valid = false; #if !GNULIB_MBRTOC32_REGULAR /* Cause the next mbi_avail invocation to return false. */ @@ -237,13 +243,15 @@ typedef struct mbiter_multi mbi_iterator_t; #define mbi_init(iter, startptr, length) \ ((iter).cur.ptr = (startptr), (iter).limit = (iter).cur.ptr + (length), \ (iter).in_shift = false, mbszero (&(iter).state), \ - (iter).next_done = false) + (iter).next_done = false, \ + (iter).is_utf8 = -1) #else /* Optimized: no in_shift. */ #define mbi_init(iter, startptr, length) \ ((iter).cur.ptr = (startptr), (iter).limit = (iter).cur.ptr + (length), \ mbszero (&(iter).state), \ - (iter).next_done = false) + (iter).next_done = false, \ + (iter).is_utf8 = -1) #endif #if !GNULIB_MBRTOC32_REGULAR #define mbi_avail(iter) \ diff --git a/modules/mbiter b/modules/mbiter index 19c75b4ae1..c16d36e213 100644 --- a/modules/mbiter +++ b/modules/mbiter @@ -13,6 +13,7 @@ mbchar mbrtoc32 mbsinit mbszero +mbiter-aux uchar-h bool diff --git a/tests/test-mbs_endswith2.c b/tests/test-mbs_endswith2.c index 17ccc1f6e0..d9bcc9f881 100644 --- a/tests/test-mbs_endswith2.c +++ b/tests/test-mbs_endswith2.c @@ -25,6 +25,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -92,10 +106,13 @@ main () /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ ASSERT (!mbs_endswith ("\341\200\240", "\200\240")); ASSERT (!mbs_endswith ("\341\200\240", "\240")); + ASSERT (mbs_endswith ("\341\200X", "\200X") == OR(false,true)); /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ ASSERT (!mbs_endswith ("\360\221\222\240", "\221\222\240")); ASSERT (!mbs_endswith ("\360\221\222\240", "\222\240")); ASSERT (!mbs_endswith ("\360\221\222\240", "\240")); + ASSERT (mbs_endswith ("\360\221\222X", "\222X") == OR(false,true)); + ASSERT (mbs_endswith ("\360\221X", "\221X") == OR(false,true)); /* Two invalid characters should match only if they are identical. */ /* "\301\246" = 0xC1 0xA6 is invalid. diff --git a/tests/test-mbs_startswith2.c b/tests/test-mbs_startswith2.c index 0ab6a9eeaf..069503af3e 100644 --- a/tests/test-mbs_startswith2.c +++ b/tests/test-mbs_startswith2.c @@ -114,6 +114,7 @@ main () ASSERT (!mbs_startswith ("\341\200\240", "\341\200")); ASSERT (!mbs_startswith ("\341\200\240", "\341")); ASSERT (mbs_startswith ("\341\200", "\341") == OR(false,true)); + ASSERT (mbs_startswith ("\341\200\341\200", "\341\200")); /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ ASSERT (!mbs_startswith ("\360\221\222\240", "\360\221\222")); ASSERT (!mbs_startswith ("\360\221\222\240", "\360\221")); @@ -121,6 +122,8 @@ main () ASSERT (mbs_startswith ("\360\221\222", "\360\221") == OR(false,true)); ASSERT (mbs_startswith ("\360\221\222", "\360") == OR(false,true)); ASSERT (mbs_startswith ("\360\221", "\360") == OR(false,true)); + ASSERT (mbs_startswith ("\360\221\222\360\221\222", "\360\221\222")); + ASSERT (mbs_startswith ("\360\221\360\221", "\360\221")); /* "\355\240\200" = 0xED 0xA0 0x80 = U+D800 is invalid. In fact, "\355\240" = 0xED 0xA0 is already invalid, see -- 2.54.0
>From 151d374a800befdcd7f5d186b11a80c372408317 Mon Sep 17 00:00:00 2001 From: Bruno Haible <[email protected]> Date: Tue, 26 May 2026 00:42:39 +0200 Subject: [PATCH 3/5] mbiterf: Implement multi-byte per encoding error (MEE) consistently. * lib/mbiterf.h: Include mbiter-aux.h. (struct mbif_state): Add field is_utf8. (mbiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbif_init): Initialize the field is_utf8. * modules/mbiterf (Depends-on): Add mbiter-aux. * tests/test-mbsnlen.c (main): Add test cases with incomplete characters not at the end of the string. --- ChangeLog | 11 +++++++++++ lib/mbiterf.h | 19 ++++++++++++++----- modules/mbiterf | 1 + tests/test-mbsnlen.c | 9 +++++++++ 4 files changed, 35 insertions(+), 5 deletions(-) diff --git a/ChangeLog b/ChangeLog index 891109da37..1e7d2e1e94 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,14 @@ +2026-05-25 Bruno Haible <[email protected]> + + mbiterf: Implement multi-byte per encoding error (MEE) consistently. + * lib/mbiterf.h: Include mbiter-aux.h. + (struct mbif_state): Add field is_utf8. + (mbiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. + (mbif_init): Initialize the field is_utf8. + * modules/mbiterf (Depends-on): Add mbiter-aux. + * tests/test-mbsnlen.c (main): Add test cases with incomplete characters + not at the end of the string. + 2026-05-25 Bruno Haible <[email protected]> mbiter: Implement multi-byte per encoding error (MEE) consistently. diff --git a/lib/mbiterf.h b/lib/mbiterf.h index 50a39c0f92..655c4d49c4 100644 --- a/lib/mbiterf.h +++ b/lib/mbiterf.h @@ -86,6 +86,7 @@ #include <wchar.h> #include "mbchar.h" +#include "mbiter-aux.h" _GL_INLINE_HEADER_BEGIN #ifndef MBITERF_INLINE @@ -108,6 +109,7 @@ struct mbif_state /* If GNULIB_MBRTOC32_REGULAR, it is in an initial state before and after every mbiterf_next invocation. */ + int is_utf8; /* A cache of mbiter_is_utf8. */ }; MBITERF_INLINE mbchar_t @@ -135,18 +137,23 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr) ps->in_shift = true; with_shift:; #endif + size_t avail_bytes = endptr - iter; size_t bytes; char32_t wc; - bytes = mbrtoc32 (&wc, iter, endptr - iter, &ps->state); + bytes = mbrtoc32 (&wc, iter, avail_bytes, &ps->state); if (bytes == (size_t) -1) { /* An invalid multibyte sequence was encountered. */ + size_t ebytes = + (mbiter_is_utf8 (&ps->is_utf8) + ? mbiter_utf8_maximal_subpart (iter, avail_bytes) + : 1); /* Allow the next invocation to continue from a sane state. */ #if !GNULIB_MBRTOC32_REGULAR ps->in_shift = false; #endif mbszero (&ps->state); - return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false }; + return (mbchar_t) { .ptr = iter, .bytes = ebytes, .wc_valid = false }; } else if (bytes == (size_t) -2) { @@ -156,7 +163,7 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr) #endif /* Whether to reset ps->state or not is not important; the string end is reached anyway. */ - return (mbchar_t) { .ptr = iter, .bytes = endptr - iter, .wc_valid = false }; + return (mbchar_t) { .ptr = iter, .bytes = avail_bytes, .wc_valid = false }; } else { @@ -189,11 +196,13 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr) typedef struct mbif_state mbif_state_t; #if !GNULIB_MBRTOC32_REGULAR #define mbif_init(st) \ - ((st).in_shift = false, mbszero (&(st).state)) + ((st).in_shift = false, mbszero (&(st).state), \ + (st).is_utf8 = -1) #else /* Optimized: no in_shift. */ #define mbif_init(st) \ - (mbszero (&(st).state)) + (mbszero (&(st).state), \ + (st).is_utf8 = -1) #endif #if !GNULIB_MBRTOC32_REGULAR #define mbif_avail(st, iter, endptr) ((st).in_shift || ((iter) < (endptr))) diff --git a/modules/mbiterf b/modules/mbiterf index 9826b77c73..7825271c3b 100644 --- a/modules/mbiterf +++ b/modules/mbiterf @@ -13,6 +13,7 @@ mbchar mbrtoc32 mbsinit mbszero +mbiter-aux uchar-h bool diff --git a/tests/test-mbsnlen.c b/tests/test-mbsnlen.c index 66a4d70a88..4d4bfff475 100644 --- a/tests/test-mbsnlen.c +++ b/tests/test-mbsnlen.c @@ -82,9 +82,18 @@ main () ASSERT (mbsnlen ("\360\237\220\203", 4) == 1); ASSERT (mbsnlen ("\360\237\220\203", 5) == 2); + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + ASSERT (mbsnlen ("\303", 1) == 1); /* invalid multibyte sequence */ + ASSERT (mbsnlen ("\303\303", 2) == 2); /* 2x invalid multibyte sequence */ + ASSERT (mbsnlen ("\342\202", 2) == OR(1,2)); /* invalid multibyte sequence */ + ASSERT (mbsnlen ("\342\202\342\202", 4) == 2 * OR(1,2)); /* 2x invalid multibyte sequence */ + ASSERT (mbsnlen ("\360\237\220", 3) == OR(1,3)); /* invalid multibyte sequence */ + ASSERT (mbsnlen ("\360\237\220\360\237\220", 6) == 2 * OR(1,3)); /* 2x invalid multibyte sequence */ return test_exit_status; } -- 2.54.0
From 8bfd5e31981e99480993c44de6f77e572ecc0a2a Mon Sep 17 00:00:00 2001 From: Bruno Haible <[email protected]> Date: Tue, 26 May 2026 00:53:31 +0200 Subject: [PATCH 4/5] mbuiter: Implement multi-byte per encoding error (MEE) consistently. * lib/mbuiter.h: Include mbiter-aux.h. (struct mbuiter_multi): Add field is_utf8. (mbuiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbui_init): Initialize the field is_utf8. * modules/mbuiter (Depends-on): Add mbiter-aux. * tests/test-mbsstr2.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbsstr1.c: Update comments. * tests/test-mbsstr3.c: Likewise. * tests/test-mbscasestr2.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbscasestr1.c: Update comments. * tests/test-mbscasestr3.c: Likewise. * tests/test-mbscasestr4.c: Likewise. --- ChangeLog | 21 ++++++++++ lib/mbuiter.h | 17 +++++--- modules/mbuiter | 1 + tests/test-mbscasestr1.c | 2 +- tests/test-mbscasestr2.c | 84 +++++++++++++++++++++++++++++++++++++++- tests/test-mbscasestr3.c | 2 +- tests/test-mbscasestr4.c | 2 +- tests/test-mbsstr1.c | 2 +- tests/test-mbsstr2.c | 84 +++++++++++++++++++++++++++++++++++++++- tests/test-mbsstr3.c | 2 +- 10 files changed, 204 insertions(+), 13 deletions(-) diff --git a/ChangeLog b/ChangeLog index 1e7d2e1e94..7c6b25c471 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,24 @@ +2026-05-25 Bruno Haible <[email protected]> + + mbuiter: Implement multi-byte per encoding error (MEE) consistently. + * lib/mbuiter.h: Include mbiter-aux.h. + (struct mbuiter_multi): Add field is_utf8. + (mbuiter_multi_next): Invoke mbiter_is_utf8, + mbiter_utf8_maximal_subpart. + (mbui_init): Initialize the field is_utf8. + * modules/mbuiter (Depends-on): Add mbiter-aux. + * tests/test-mbsstr2.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add test cases with incomplete characters. + * tests/test-mbsstr1.c: Update comments. + * tests/test-mbsstr3.c: Likewise. + * tests/test-mbscasestr2.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add test cases with incomplete characters. + * tests/test-mbscasestr1.c: Update comments. + * tests/test-mbscasestr3.c: Likewise. + * tests/test-mbscasestr4.c: Likewise. + 2026-05-25 Bruno Haible <[email protected]> mbiterf: Implement multi-byte per encoding error (MEE) consistently. diff --git a/lib/mbuiter.h b/lib/mbuiter.h index 0f13e732e1..b5bb18305f 100644 --- a/lib/mbuiter.h +++ b/lib/mbuiter.h @@ -103,6 +103,7 @@ #include <wchar.h> #include "mbchar.h" +#include "mbiter-aux.h" #include "strnlen1.h" _GL_INLINE_HEADER_BEGIN @@ -128,6 +129,7 @@ struct mbuiter_multi */ bool next_done; /* true if mbui_avail has already filled the following */ unsigned int cur_max; /* A cache of MB_CUR_MAX. */ + int is_utf8; /* A cache of mbiter_is_utf8. */ struct mbchar cur; /* the current character: const char *cur.ptr pointer to current character The following are only valid after mbui_avail. @@ -164,15 +166,18 @@ mbuiter_multi_next (struct mbuiter_multi *iter) assert (mbsinit (&iter->state)); #if !GNULIB_MBRTOC32_REGULAR iter->in_shift = true; - with_shift: + with_shift:; #endif - iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, - strnlen1 (iter->cur.ptr, iter->cur_max), + size_t avail_bytes = strnlen1 (iter->cur.ptr, iter->cur_max); + iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, avail_bytes, &iter->state); if (iter->cur.bytes == (size_t) -1) { /* An invalid multibyte sequence was encountered. */ - iter->cur.bytes = 1; + iter->cur.bytes = + (mbiter_is_utf8 (&iter->is_utf8) + ? mbiter_utf8_maximal_subpart (iter->cur.ptr, avail_bytes) + : 1); iter->cur.wc_valid = false; /* Allow the next invocation to continue from a sane state. */ #if !GNULIB_MBRTOC32_REGULAR @@ -243,14 +248,14 @@ typedef struct mbuiter_multi mbui_iterator_t; ((iter).cur.ptr = (startptr), \ (iter).in_shift = false, mbszero (&(iter).state), \ (iter).next_done = false, \ - (iter).cur_max = MB_CUR_MAX) + (iter).cur_max = MB_CUR_MAX, (iter).is_utf8 = -1) #else /* Optimized: no in_shift. */ #define mbui_init(iter, startptr) \ ((iter).cur.ptr = (startptr), \ mbszero (&(iter).state), \ (iter).next_done = false, \ - (iter).cur_max = MB_CUR_MAX) + (iter).cur_max = MB_CUR_MAX, (iter).is_utf8 = -1) #endif #define mbui_avail(iter) \ (mbuiter_multi_next (&(iter)), !mb_isnul ((iter).cur)) diff --git a/modules/mbuiter b/modules/mbuiter index d9deba2d4b..f9daba54e2 100644 --- a/modules/mbuiter +++ b/modules/mbuiter @@ -13,6 +13,7 @@ mbchar mbrtoc32 mbsinit mbszero +mbiter-aux uchar-h bool strnlen1 diff --git a/tests/test-mbscasestr1.c b/tests/test-mbscasestr1.c index 1d98bef956..ddc3ab84dd 100644 --- a/tests/test-mbscasestr1.c +++ b/tests/test-mbscasestr1.c @@ -1,4 +1,4 @@ -/* Test of case-insensitive searching in a string. +/* Test of case-insensitive searching in a string in the "C" locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify diff --git a/tests/test-mbscasestr2.c b/tests/test-mbscasestr2.c index 7f05ebf91f..66582eba40 100644 --- a/tests/test-mbscasestr2.c +++ b/tests/test-mbscasestr2.c @@ -1,4 +1,4 @@ -/* Test of searching in a string. +/* Test of searching in a string in a UTF-8 locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify @@ -25,6 +25,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -50,6 +64,74 @@ main () ASSERT (result == NULL); } + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "f\341\200\341\200"; + const char *result = mbscasestr (input, ""); + ASSERT (result == input); + } + { + const char input[] = "f\341\200\341\200"; + const char *result = mbscasestr (input, "\341\200"); + ASSERT (result == input + 1); + } + { + const char input[] = "f\341\200\341\200"; + const char *result = mbscasestr (input, "\200\341"); + ASSERT (result == OR (NULL, input + 2)); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbscasestr (input, ""); + ASSERT (result == input); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbscasestr (input, "\360\221\222"); + ASSERT (result == input + 1); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbscasestr (input, "\221\222\360\221"); + ASSERT (result == OR (NULL, input + 2)); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbscasestr (input, "\221\222\360"); + ASSERT (result == OR (NULL, input + 2)); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbscasestr (input, "\222\360\221"); + ASSERT (result == OR (NULL, input + 3)); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbscasestr (input, "\222\360"); + ASSERT (result == OR (NULL, input + 3)); + } + { + const char input[] = "f\360\221\360\221"; + const char *result = mbscasestr (input, ""); + ASSERT (result == input); + } + { + const char input[] = "f\360\221\360\221"; + const char *result = mbscasestr (input, "\360\221"); + ASSERT (result == input + 1); + } + { + const char input[] = "f\360\221\360\221"; + const char *result = mbscasestr (input, "\221\360"); + ASSERT (result == OR (NULL, input + 2)); + } + { const char input[] = "\303\204BC \303\204BCD\303\204B \303\204BCD\303\204BCD\303\204BDE"; /* "??BC ??BCD??B ??BCD??BCD??BDE" */ const char *result = mbscasestr (input, "\303\244BCD\303\204BD"); /* "??BCD??BD" */ diff --git a/tests/test-mbscasestr3.c b/tests/test-mbscasestr3.c index bccd34deaa..496d2293d6 100644 --- a/tests/test-mbscasestr3.c +++ b/tests/test-mbscasestr3.c @@ -1,4 +1,4 @@ -/* Test of case-insensitive searching in a string. +/* Test of case-insensitive searching in a string in a GB18030 locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify diff --git a/tests/test-mbscasestr4.c b/tests/test-mbscasestr4.c index 41ce6a91a1..7e8a026a05 100644 --- a/tests/test-mbscasestr4.c +++ b/tests/test-mbscasestr4.c @@ -1,4 +1,4 @@ -/* Test of case-insensitive searching in a string. +/* Test of case-insensitive searching in a string in a Turkish locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify diff --git a/tests/test-mbsstr1.c b/tests/test-mbsstr1.c index d7a73486a3..bcdbb5c83c 100644 --- a/tests/test-mbsstr1.c +++ b/tests/test-mbsstr1.c @@ -1,4 +1,4 @@ -/* Test of searching in a string. +/* Test of searching in a string in the "C" locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify diff --git a/tests/test-mbsstr2.c b/tests/test-mbsstr2.c index 93db31ef94..184abbe8f4 100644 --- a/tests/test-mbsstr2.c +++ b/tests/test-mbsstr2.c @@ -1,4 +1,4 @@ -/* Test of searching in a string. +/* Test of searching in a string in a UTF-8 locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify @@ -25,6 +25,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -50,6 +64,74 @@ main () ASSERT (result == NULL); } + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "f\341\200\341\200"; + const char *result = mbsstr (input, ""); + ASSERT (result == input); + } + { + const char input[] = "f\341\200\341\200"; + const char *result = mbsstr (input, "\341\200"); + ASSERT (result == input + 1); + } + { + const char input[] = "f\341\200\341\200"; + const char *result = mbsstr (input, "\200\341"); + ASSERT (result == OR (NULL, input + 2)); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbsstr (input, ""); + ASSERT (result == input); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbsstr (input, "\360\221\222"); + ASSERT (result == input + 1); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbsstr (input, "\221\222\360\221"); + ASSERT (result == OR (NULL, input + 2)); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbsstr (input, "\221\222\360"); + ASSERT (result == OR (NULL, input + 2)); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbsstr (input, "\222\360\221"); + ASSERT (result == OR (NULL, input + 3)); + } + { + const char input[] = "f\360\221\222\360\221\222"; + const char *result = mbsstr (input, "\222\360"); + ASSERT (result == OR (NULL, input + 3)); + } + { + const char input[] = "f\360\221\360\221"; + const char *result = mbsstr (input, ""); + ASSERT (result == input); + } + { + const char input[] = "f\360\221\360\221"; + const char *result = mbsstr (input, "\360\221"); + ASSERT (result == input + 1); + } + { + const char input[] = "f\360\221\360\221"; + const char *result = mbsstr (input, "\221\360"); + ASSERT (result == OR (NULL, input + 2)); + } + { const char input[] = "\303\204BC \303\204BCD\303\204B \303\204BCD\303\204BCD\303\204BDE"; /* "??BC ??BCD??B ??BCD??BCD??BDE" */ const char *result = mbsstr (input, "\303\204BCD\303\204BD"); /* "??BCD??BD" */ diff --git a/tests/test-mbsstr3.c b/tests/test-mbsstr3.c index 71196150b6..66dc276a55 100644 --- a/tests/test-mbsstr3.c +++ b/tests/test-mbsstr3.c @@ -1,4 +1,4 @@ -/* Test of searching in a string. +/* Test of searching in a string in a GB18030 locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify -- 2.54.0
From 1e7cbc30fd9fa8790583e7c24ac3ca2f46542bdf Mon Sep 17 00:00:00 2001 From: Bruno Haible <[email protected]> Date: Tue, 26 May 2026 01:29:29 +0200 Subject: [PATCH 5/5] mbuiterf: Implement multi-byte per encoding error (MEE) consistently. * lib/mbuiterf.h: Include mbiter-aux.h. (struct mbuif_state): Add field is_utf8. (mbuiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. (mbuif_init): Initialize the field is_utf8. * modules/mbuiterf (Depends-on): Add mbiter-aux. * tests/test-mbslen.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add more test cases with incomplete characters. * tests/test-mbschr2.sh: Renamed from tests/test-mbschr.sh. * tests/test-mbschr2.c: Renamed from tests/test-mbschr.c. * tests/test-mbschr1.sh: New file, based on tests/test-mbmemcasecmp-3.sh. * tests/test-mbschr1.c: New file. * modules/mbschr-tests (Files): Update accordingly. Add locale-en.m4, locale-fr.m4. (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8. (Makefile.am): Arrange to compile test-mbschr1 and test-mbschr2 and to run test-mbschr1.sh, test-mbschr2.sh. * tests/test-mbsrchr2.sh: Renamed from tests/test-mbsrchr.sh. * tests/test-mbsrchr2.c: Renamed from tests/test-mbsrchr.c. * tests/test-mbsrchr1.sh: New file, based on tests/test-mbmemcasecmp-3.sh. * tests/test-mbsrchr1.c: New file. * modules/mbsrchr-tests (Files): Update accordingly. Add locale-en.m4, locale-fr.m4. (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8. (Makefile.am): Arrange to compile test-mbsrchr1 and test-mbsrchr2 and to run test-mbsrchr1.sh, test-mbsrchr2.sh. * tests/test-mbscspn.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbspbrk.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. * tests/test-mbsspn.c (OR): New macro, copied from tests/test-mbsnlen.c. (main): Add test cases with incomplete characters. --- ChangeLog | 41 ++++++++ lib/mbuiterf.h | 15 ++- modules/mbschr-tests | 22 ++-- modules/mbsrchr-tests | 22 ++-- modules/mbuiterf | 1 + tests/test-mbschr1.c | 107 ++++++++++++++++++++ tests/test-mbschr1.sh | 23 +++++ tests/{test-mbschr.c => test-mbschr2.c} | 2 +- tests/{test-mbschr.sh => test-mbschr2.sh} | 2 +- tests/test-mbscspn.c | 78 ++++++++++++++ tests/test-mbslen.c | 28 ++++- tests/test-mbspbrk.c | 78 ++++++++++++++ tests/test-mbsrchr1.c | 107 ++++++++++++++++++++ tests/test-mbsrchr1.sh | 23 +++++ tests/{test-mbsrchr.c => test-mbsrchr2.c} | 0 tests/{test-mbsrchr.sh => test-mbsrchr2.sh} | 2 +- tests/test-mbsspn.c | 58 +++++++++++ 17 files changed, 587 insertions(+), 22 deletions(-) create mode 100644 tests/test-mbschr1.c create mode 100755 tests/test-mbschr1.sh rename tests/{test-mbschr.c => test-mbschr2.c} (96%) rename tests/{test-mbschr.sh => test-mbschr2.sh} (90%) create mode 100644 tests/test-mbsrchr1.c create mode 100755 tests/test-mbsrchr1.sh rename tests/{test-mbsrchr.c => test-mbsrchr2.c} (100%) rename tests/{test-mbsrchr.sh => test-mbsrchr2.sh} (90%) diff --git a/ChangeLog b/ChangeLog index 7c6b25c471..3b9f26165d 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,44 @@ +2026-05-25 Bruno Haible <[email protected]> + + mbuiterf: Implement multi-byte per encoding error (MEE) consistently. + * lib/mbuiterf.h: Include mbiter-aux.h. + (struct mbuif_state): Add field is_utf8. + (mbuiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart. + (mbuif_init): Initialize the field is_utf8. + * modules/mbuiterf (Depends-on): Add mbiter-aux. + * tests/test-mbslen.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add more test cases with incomplete characters. + * tests/test-mbschr2.sh: Renamed from tests/test-mbschr.sh. + * tests/test-mbschr2.c: Renamed from tests/test-mbschr.c. + * tests/test-mbschr1.sh: New file, based on + tests/test-mbmemcasecmp-3.sh. + * tests/test-mbschr1.c: New file. + * modules/mbschr-tests (Files): Update accordingly. Add locale-en.m4, + locale-fr.m4. + (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8. + (Makefile.am): Arrange to compile test-mbschr1 and test-mbschr2 and to + run test-mbschr1.sh, test-mbschr2.sh. + * tests/test-mbsrchr2.sh: Renamed from tests/test-mbsrchr.sh. + * tests/test-mbsrchr2.c: Renamed from tests/test-mbsrchr.c. + * tests/test-mbsrchr1.sh: New file, based on + tests/test-mbmemcasecmp-3.sh. + * tests/test-mbsrchr1.c: New file. + * modules/mbsrchr-tests (Files): Update accordingly. Add locale-en.m4, + locale-fr.m4. + (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8. + (Makefile.am): Arrange to compile test-mbsrchr1 and test-mbsrchr2 and to + run test-mbsrchr1.sh, test-mbsrchr2.sh. + * tests/test-mbscspn.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add test cases with incomplete characters. + * tests/test-mbspbrk.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add test cases with incomplete characters. + * tests/test-mbsspn.c (OR): New macro, copied from + tests/test-mbsnlen.c. + (main): Add test cases with incomplete characters. + 2026-05-25 Bruno Haible <[email protected]> mbuiter: Implement multi-byte per encoding error (MEE) consistently. diff --git a/lib/mbuiterf.h b/lib/mbuiterf.h index f8cb0f9595..19761a88a4 100644 --- a/lib/mbuiterf.h +++ b/lib/mbuiterf.h @@ -94,6 +94,7 @@ #include <wchar.h> #include "mbchar.h" +#include "mbiter-aux.h" #include "strnlen1.h" _GL_INLINE_HEADER_BEGIN @@ -118,6 +119,7 @@ struct mbuif_state before and after every mbuiterf_next invocation. */ unsigned int cur_max; /* A cache of MB_CUR_MAX. */ + int is_utf8; /* A cache of mbiter_is_utf8. */ }; MBUITERF_INLINE mbchar_t @@ -145,18 +147,23 @@ mbuiterf_next (struct mbuif_state *ps, const char *iter) ps->in_shift = true; with_shift:; #endif + size_t avail_bytes = strnlen1 (iter, ps->cur_max); size_t bytes; char32_t wc; - bytes = mbrtoc32 (&wc, iter, strnlen1 (iter, ps->cur_max), &ps->state); + bytes = mbrtoc32 (&wc, iter, avail_bytes, &ps->state); if (bytes == (size_t) -1) { /* An invalid multibyte sequence was encountered. */ + size_t ebytes = + (mbiter_is_utf8 (&ps->is_utf8) + ? mbiter_utf8_maximal_subpart (iter, avail_bytes) + : 1); /* Allow the next invocation to continue from a sane state. */ #if !GNULIB_MBRTOC32_REGULAR ps->in_shift = false; #endif mbszero (&ps->state); - return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false }; + return (mbchar_t) { .ptr = iter, .bytes = ebytes, .wc_valid = false }; } else if (bytes == (size_t) -2) { @@ -197,12 +204,12 @@ typedef struct mbuif_state mbuif_state_t; #if !GNULIB_MBRTOC32_REGULAR #define mbuif_init(st) \ ((st).in_shift = false, mbszero (&(st).state), \ - (st).cur_max = MB_CUR_MAX) + (st).cur_max = MB_CUR_MAX, (st).is_utf8 = -1) #else /* Optimized: no in_shift. */ #define mbuif_init(st) \ (mbszero (&(st).state), \ - (st).cur_max = MB_CUR_MAX) + (st).cur_max = MB_CUR_MAX, (st).is_utf8 = -1) #endif #if !GNULIB_MBRTOC32_REGULAR #define mbuif_avail(st, iter) ((st).in_shift || (*(iter) != '\0')) diff --git a/modules/mbschr-tests b/modules/mbschr-tests index ef26e73363..fb879f2baa 100644 --- a/modules/mbschr-tests +++ b/modules/mbschr-tests @@ -1,7 +1,11 @@ Files: -tests/test-mbschr.sh -tests/test-mbschr.c +tests/test-mbschr1.sh +tests/test-mbschr1.c +tests/test-mbschr2.sh +tests/test-mbschr2.c tests/macros.h +m4/locale-en.m4 +m4/locale-fr.m4 m4/locale-zh.m4 m4/codeset.m4 @@ -9,10 +13,16 @@ Depends-on: setlocale configure.ac: +gt_LOCALE_EN_UTF8 +gt_LOCALE_FR_UTF8 gt_LOCALE_ZH_CN Makefile.am: -TESTS += test-mbschr.sh -TESTS_ENVIRONMENT += LOCALE_ZH_CN='@LOCALE_ZH_CN@' -check_PROGRAMS += test-mbschr -test_mbschr_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) +TESTS += test-mbschr1.sh test-mbschr2.sh +TESTS_ENVIRONMENT += \ + LOCALE_EN_UTF8='@LOCALE_EN_UTF8@' \ + LOCALE_FR_UTF8='@LOCALE_FR_UTF8@' \ + LOCALE_ZH_CN='@LOCALE_ZH_CN@' +check_PROGRAMS += test-mbschr1 test-mbschr2 +test_mbschr1_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) +test_mbschr2_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) diff --git a/modules/mbsrchr-tests b/modules/mbsrchr-tests index dba1470789..07243ca86f 100644 --- a/modules/mbsrchr-tests +++ b/modules/mbsrchr-tests @@ -1,7 +1,11 @@ Files: -tests/test-mbsrchr.sh -tests/test-mbsrchr.c +tests/test-mbsrchr1.sh +tests/test-mbsrchr1.c +tests/test-mbsrchr2.sh +tests/test-mbsrchr2.c tests/macros.h +m4/locale-en.m4 +m4/locale-fr.m4 m4/locale-zh.m4 m4/codeset.m4 @@ -9,10 +13,16 @@ Depends-on: setlocale configure.ac: +gt_LOCALE_EN_UTF8 +gt_LOCALE_FR_UTF8 gt_LOCALE_ZH_CN Makefile.am: -TESTS += test-mbsrchr.sh -TESTS_ENVIRONMENT += LOCALE_ZH_CN='@LOCALE_ZH_CN@' -check_PROGRAMS += test-mbsrchr -test_mbsrchr_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) +TESTS += test-mbsrchr1.sh test-mbsrchr2.sh +TESTS_ENVIRONMENT += \ + LOCALE_EN_UTF8='@LOCALE_EN_UTF8@' \ + LOCALE_FR_UTF8='@LOCALE_FR_UTF8@' \ + LOCALE_ZH_CN='@LOCALE_ZH_CN@' +check_PROGRAMS += test-mbsrchr1 test-mbsrchr2 +test_mbsrchr1_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) +test_mbsrchr2_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) diff --git a/modules/mbuiterf b/modules/mbuiterf index e5e22f9d09..d93cc8fa73 100644 --- a/modules/mbuiterf +++ b/modules/mbuiterf @@ -13,6 +13,7 @@ mbchar mbrtoc32 mbsinit mbszero +mbiter-aux uchar-h bool strnlen1 diff --git a/tests/test-mbschr1.c b/tests/test-mbschr1.c new file mode 100644 index 0000000000..1000491e7c --- /dev/null +++ b/tests/test-mbschr1.c @@ -0,0 +1,107 @@ +/* Test of searching a string for a character in a UTF-8 locale. + Copyright (C) 2026 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <[email protected]>, 2026. */ + +#include <config.h> + +#include <string.h> + +#include <locale.h> + +#include "macros.h" + +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + +int +main () +{ + /* configure should already have checked that the locale is supported. */ + if (setlocale (LC_ALL, "") == NULL) + return 1; + + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "\341\200"; + const char *result = mbschr (input, '\341'); + ASSERT (result == OR (NULL, input + 0)); + } + { + const char input[] = "\341\200"; + const char *result = mbschr (input, '\200'); + ASSERT (result == OR (NULL, input + 1)); + } + { + const char input[] = "\341\200\341"; + const char *result = mbschr (input, '\341'); + ASSERT (result == input + OR(2,0)); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "\360\221\222"; + const char *result = mbschr (input, '\360'); + ASSERT (result == OR (NULL, input + 0)); + } + { + const char input[] = "\360\221\222"; + const char *result = mbschr (input, '\221'); + ASSERT (result == OR (NULL, input + 1)); + } + { + const char input[] = "\360\221\222"; + const char *result = mbschr (input, '\222'); + ASSERT (result == OR (NULL, input + 2)); + } + { + const char input[] = "\360\221\222\360"; + const char *result = mbschr (input, '\360'); + ASSERT (result == input + OR(3,0)); + } + { + const char input[] = "\360\221"; + const char *result = mbschr (input, '\360'); + ASSERT (result == OR (NULL, input + 0)); + } + { + const char input[] = "\360\221"; + const char *result = mbschr (input, '\221'); + ASSERT (result == OR (NULL, input + 1)); + } + { + const char input[] = "\360\221\360"; + const char *result = mbschr (input, '\360'); + ASSERT (result == input + OR(2,0)); + } + + return test_exit_status; +} diff --git a/tests/test-mbschr1.sh b/tests/test-mbschr1.sh new file mode 100755 index 0000000000..48e258d63e --- /dev/null +++ b/tests/test-mbschr1.sh @@ -0,0 +1,23 @@ +#!/bin/sh + +# Test whether a specific UTF-8 locale is installed. +: "${LOCALE_EN_UTF8=en_US.UTF-8}" +: "${LOCALE_FR_UTF8=fr_FR.UTF-8}" +if test "$LOCALE_EN_UTF8" = none && test $LOCALE_FR_UTF8 = none; then + if test -f /usr/bin/localedef; then + echo "Skipping test: no english or french Unicode locale is installed" + else + echo "Skipping test: no english or french Unicode locale is supported" + fi + exit 77 +fi + +# It's sufficient to test in one of the two locales. +if test $LOCALE_FR_UTF8 != none; then + testlocale=$LOCALE_FR_UTF8 +else + testlocale="$LOCALE_EN_UTF8" +fi + +LC_ALL="$testlocale" \ +${CHECKER} ./test-mbschr1${EXEEXT} diff --git a/tests/test-mbschr.c b/tests/test-mbschr2.c similarity index 96% rename from tests/test-mbschr.c rename to tests/test-mbschr2.c index f7678eb41b..5eae208c97 100644 --- a/tests/test-mbschr.c +++ b/tests/test-mbschr2.c @@ -1,4 +1,4 @@ -/* Test of searching a string for a character. +/* Test of searching a string for a character in a GB18030 locale. Copyright (C) 2007-2026 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify diff --git a/tests/test-mbschr.sh b/tests/test-mbschr2.sh similarity index 90% rename from tests/test-mbschr.sh rename to tests/test-mbschr2.sh index 7e62b3f08a..c75973c362 100755 --- a/tests/test-mbschr.sh +++ b/tests/test-mbschr2.sh @@ -12,4 +12,4 @@ if test $LOCALE_ZH_CN = none; then fi LC_ALL=$LOCALE_ZH_CN \ -${CHECKER} ./test-mbschr${EXEEXT} +${CHECKER} ./test-mbschr2${EXEEXT} diff --git a/tests/test-mbscspn.c b/tests/test-mbscspn.c index 0fa513748f..5cc248711d 100644 --- a/tests/test-mbscspn.c +++ b/tests/test-mbscspn.c @@ -24,6 +24,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -57,5 +71,69 @@ main () ASSERT (mbscspn (input, "\303") == 14); /* invalid multibyte sequence */ } + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "\341\200\240x\341\200y"; + ASSERT (mbscspn (input, "\341\200") == 4); + } + { + const char input[] = "\341\200\240x\341\200"; + ASSERT (mbscspn (input, "\341\200") == 4); + } + { + const char input[] = "\341\200\240x\341\200"; + ASSERT (mbscspn (input, "\341") == OR(6,4)); + } + { + const char input[] = "\341\200\240x\341y"; + ASSERT (mbscspn (input, "\341") == 4); + } + { + const char input[] = "\341\200\240x\341"; + ASSERT (mbscspn (input, "\341") == 4); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "\360\221\222\240x\360\221\222y"; + ASSERT (mbscspn (input, "\360\221\222") == 5); + } + { + const char input[] = "\360\221\222\240x\360\221\222"; + ASSERT (mbscspn (input, "\360\221\222") == 5); + } + { + const char input[] = "\360\221\222\240x\360\221\222"; + ASSERT (mbscspn (input, "\360\221") == OR(8,5)); + } + { + const char input[] = "\360\221\222\240x\360\221y"; + ASSERT (mbscspn (input, "\360\221") == 5); + } + { + const char input[] = "\360\221\222\240x\360\221"; + ASSERT (mbscspn (input, "\360\221") == 5); + } + { + const char input[] = "\360\221\222\240x\360\221\222"; + ASSERT (mbscspn (input, "\360") == OR(8,5)); + } + { + const char input[] = "\360\221\222\240x\360\221"; + ASSERT (mbscspn (input, "\360") == OR(7,5)); + } + { + const char input[] = "\360\221\222\240x\360y"; + ASSERT (mbscspn (input, "\360") == 5); + } + { + const char input[] = "\360\221\222\240x\360"; + ASSERT (mbscspn (input, "\360") == 5); + } + return test_exit_status; } diff --git a/tests/test-mbslen.c b/tests/test-mbslen.c index b32a74a296..9cf8673579 100644 --- a/tests/test-mbslen.c +++ b/tests/test-mbslen.c @@ -24,6 +24,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -39,9 +53,17 @@ main () ASSERT (mbslen ("7\342\202\254") == 2); /* "7???" */ ASSERT (mbslen ("\360\237\220\203") == 1); /* "????" */ - ASSERT (mbslen ("\303") == 1); /* invalid multibyte sequence */ - ASSERT (mbslen ("\342\202") == 2); /* 2x invalid multibyte sequence */ - ASSERT (mbslen ("\360\237\220") == 3); /* 3x invalid multibyte sequence */ + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + ASSERT (mbslen ("\303") == 1); + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + ASSERT (mbslen ("\341\200") == OR(1,2)); + ASSERT (mbslen ("\341") == 1); + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + ASSERT (mbslen ("\360\221\222") == OR(1,3)); + ASSERT (mbslen ("\360\221") == OR(1,2)); + ASSERT (mbslen ("\360") == 1); return test_exit_status; } diff --git a/tests/test-mbspbrk.c b/tests/test-mbspbrk.c index ce396eba18..a0f86d3652 100644 --- a/tests/test-mbspbrk.c +++ b/tests/test-mbspbrk.c @@ -24,6 +24,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -51,5 +65,69 @@ main () ASSERT (mbspbrk (input, "\303") == NULL); /* invalid multibyte sequence */ } + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "\341\200\240x\341\200y"; + ASSERT (mbspbrk (input, "\341\200") == input + 4); + } + { + const char input[] = "\341\200\240x\341\200"; + ASSERT (mbspbrk (input, "\341\200") == input + 4); + } + { + const char input[] = "\341\200\240x\341\200"; + ASSERT (mbspbrk (input, "\341") == OR (NULL, input + 4)); + } + { + const char input[] = "\341\200\240x\341y"; + ASSERT (mbspbrk (input, "\341") == input + 4); + } + { + const char input[] = "\341\200\240x\341"; + ASSERT (mbspbrk (input, "\341") == input + 4); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "\360\221\222\240x\360\221\222y"; + ASSERT (mbspbrk (input, "\360\221\222") == input + 5); + } + { + const char input[] = "\360\221\222\240x\360\221\222"; + ASSERT (mbspbrk (input, "\360\221\222") == input + 5); + } + { + const char input[] = "\360\221\222\240x\360\221\222"; + ASSERT (mbspbrk (input, "\360\221") == OR (NULL, input + 5)); + } + { + const char input[] = "\360\221\222\240x\360\221y"; + ASSERT (mbspbrk (input, "\360\221") == input + 5); + } + { + const char input[] = "\360\221\222\240x\360\221"; + ASSERT (mbspbrk (input, "\360\221") == input + 5); + } + { + const char input[] = "\360\221\222\240x\360\221\222"; + ASSERT (mbspbrk (input, "\360") == OR (NULL, input + 5)); + } + { + const char input[] = "\360\221\222\240x\360\221"; + ASSERT (mbspbrk (input, "\360") == OR (NULL, input + 5)); + } + { + const char input[] = "\360\221\222\240x\360y"; + ASSERT (mbspbrk (input, "\360") == input + 5); + } + { + const char input[] = "\360\221\222\240x\360"; + ASSERT (mbspbrk (input, "\360") == input + 5); + } + return test_exit_status; } diff --git a/tests/test-mbsrchr1.c b/tests/test-mbsrchr1.c new file mode 100644 index 0000000000..91c5e734f6 --- /dev/null +++ b/tests/test-mbsrchr1.c @@ -0,0 +1,107 @@ +/* Test of searching a string for the last occurrence of a character. + Copyright (C) 2026 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <[email protected]>, 2026. */ + +#include <config.h> + +#include <string.h> + +#include <locale.h> + +#include "macros.h" + +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + +int +main () +{ + /* configure should already have checked that the locale is supported. */ + if (setlocale (LC_ALL, "") == NULL) + return 1; + + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "\341\200"; + const char *result = mbsrchr (input, '\341'); + ASSERT (result == OR (NULL, input + 0)); + } + { + const char input[] = "\341\200"; + const char *result = mbsrchr (input, '\200'); + ASSERT (result == OR (NULL, input + 1)); + } + { + const char input[] = "\341\200\341"; + const char *result = mbsrchr (input, '\341'); + ASSERT (result == input + 2); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "\360\221\222"; + const char *result = mbsrchr (input, '\360'); + ASSERT (result == OR (NULL, input + 0)); + } + { + const char input[] = "\360\221\222"; + const char *result = mbsrchr (input, '\221'); + ASSERT (result == OR (NULL, input + 1)); + } + { + const char input[] = "\360\221\222"; + const char *result = mbsrchr (input, '\222'); + ASSERT (result == OR (NULL, input + 2)); + } + { + const char input[] = "\360\221\222\360"; + const char *result = mbsrchr (input, '\360'); + ASSERT (result == input + 3); + } + { + const char input[] = "\360\221"; + const char *result = mbsrchr (input, '\360'); + ASSERT (result == OR (NULL, input + 0)); + } + { + const char input[] = "\360\221"; + const char *result = mbsrchr (input, '\221'); + ASSERT (result == OR (NULL, input + 1)); + } + { + const char input[] = "\360\221\360"; + const char *result = mbsrchr (input, '\360'); + ASSERT (result == input + 2); + } + + return test_exit_status; +} diff --git a/tests/test-mbsrchr1.sh b/tests/test-mbsrchr1.sh new file mode 100755 index 0000000000..ce0d000437 --- /dev/null +++ b/tests/test-mbsrchr1.sh @@ -0,0 +1,23 @@ +#!/bin/sh + +# Test whether a specific UTF-8 locale is installed. +: "${LOCALE_EN_UTF8=en_US.UTF-8}" +: "${LOCALE_FR_UTF8=fr_FR.UTF-8}" +if test "$LOCALE_EN_UTF8" = none && test $LOCALE_FR_UTF8 = none; then + if test -f /usr/bin/localedef; then + echo "Skipping test: no english or french Unicode locale is installed" + else + echo "Skipping test: no english or french Unicode locale is supported" + fi + exit 77 +fi + +# It's sufficient to test in one of the two locales. +if test $LOCALE_FR_UTF8 != none; then + testlocale=$LOCALE_FR_UTF8 +else + testlocale="$LOCALE_EN_UTF8" +fi + +LC_ALL="$testlocale" \ +${CHECKER} ./test-mbsrchr1${EXEEXT} diff --git a/tests/test-mbsrchr.c b/tests/test-mbsrchr2.c similarity index 100% rename from tests/test-mbsrchr.c rename to tests/test-mbsrchr2.c diff --git a/tests/test-mbsrchr.sh b/tests/test-mbsrchr2.sh similarity index 90% rename from tests/test-mbsrchr.sh rename to tests/test-mbsrchr2.sh index 84c40b7bf8..cce61decc6 100755 --- a/tests/test-mbsrchr.sh +++ b/tests/test-mbsrchr2.sh @@ -12,4 +12,4 @@ if test $LOCALE_ZH_CN = none; then fi LC_ALL=$LOCALE_ZH_CN \ -${CHECKER} ./test-mbsrchr${EXEEXT} +${CHECKER} ./test-mbsrchr2${EXEEXT} diff --git a/tests/test-mbsspn.c b/tests/test-mbsspn.c index cce1d08dce..d2edeaa89a 100644 --- a/tests/test-mbsspn.c +++ b/tests/test-mbsspn.c @@ -24,6 +24,20 @@ #include "macros.h" +/* The mcel-based implementation of mbsnlen behaves differently than the + original one. Namely, for invalid/incomplete byte sequences: + Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour + everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour. + See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>, + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>. + Therefore, here we have different expected results, depending on the + implementation. */ +#if GNULIB_MCEL_PREFER +# define OR(a,b) b +#else +# define OR(a,b) a +#endif + int main () { @@ -53,5 +67,49 @@ main () ASSERT (mbsspn (input, "\303") == 0); /* invalid multibyte sequence */ } + /* Incomplete characters. See + https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf + page 128 table 3-11. */ + + /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020. */ + { + const char input[] = "\341\200\341\200\240"; + ASSERT (mbsspn (input, "\341\200") == 2); + } + { + const char input[] = "\341\200\341\200\240"; + ASSERT (mbsspn (input, "\341") == OR(0,1)); + } + { + const char input[] = "\341\341\200\240"; + ASSERT (mbsspn (input, "\341") == 1); + } + + /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0. */ + { + const char input[] = "\360\221\222\360\221\222\240"; + ASSERT (mbsspn (input, "\360\221\222") == 3); + } + { + const char input[] = "\360\221\222\360\221\222\240"; + ASSERT (mbsspn (input, "\360\221") == OR(0,2)); + } + { + const char input[] = "\360\221\360\221\222\240"; + ASSERT (mbsspn (input, "\360\221") == 2); + } + { + const char input[] = "\360\221\222\360\221\222\240"; + ASSERT (mbsspn (input, "\360") == OR(0,1)); + } + { + const char input[] = "\360\221\360\221\222\240"; + ASSERT (mbsspn (input, "\360") == OR(0,1)); + } + { + const char input[] = "\360\360\221\222\240"; + ASSERT (mbsspn (input, "\360") == 1); + } + return test_exit_status; } -- 2.54.0
