mbiter: Implement multi-byte per encoding error (MEE) consistently

Bruno Haible via Gnulib discussion list Mon, 25 May 2026 16:43:45 -0700

Yesterday I wrote:
> Now I got interested in
>   - whether the mb*iter* modules actually implement MEE,
>   - what's the behavioural difference between MEE and SEE, function by
>     function.


It turns out that the mb*iter* modules, so far, implement MEE
for incomplete multibyte characters at the end of the string only.
(That is the case where mbrtoc32 returns (size_t)(-2).)

For incomplete multibyte characters inside a string — that is
the case where mbrtoc32 returns (size_t)(-1) —, these modules still
implement SEE. Ouch.

The effect is visible in several mbs* functions:
  mbslen, mbsnlen
  mbschr, mbsrchr
  mbscspn, mbspbrk, mbsspn
  mbsstr, mbscasestr
  mbs_startswith, mbs_endswith.

This series of patches implements MEE also for the case of
incomplete multibyte characters inside a string, as shown
in https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
page 128 table 3-11.


2026-05-25  Bruno Haible  <[email protected]>

        mbuiterf: Implement multi-byte per encoding error (MEE) consistently.
        * lib/mbuiterf.h: Include mbiter-aux.h.
        (struct mbuif_state): Add field is_utf8.
        (mbuiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
        (mbuif_init): Initialize the field is_utf8.
        * modules/mbuiterf (Depends-on): Add mbiter-aux.
        * tests/test-mbslen.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add more test cases with incomplete characters.
        * tests/test-mbschr2.sh: Renamed from tests/test-mbschr.sh.
        * tests/test-mbschr2.c: Renamed from tests/test-mbschr.c.
        * tests/test-mbschr1.sh: New file, based on
        tests/test-mbmemcasecmp-3.sh.
        * tests/test-mbschr1.c: New file.
        * modules/mbschr-tests (Files): Update accordingly. Add locale-en.m4,
        locale-fr.m4.
        (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8.
        (Makefile.am): Arrange to compile test-mbschr1 and test-mbschr2 and to
        run test-mbschr1.sh, test-mbschr2.sh.
        * tests/test-mbsrchr2.sh: Renamed from tests/test-mbsrchr.sh.
        * tests/test-mbsrchr2.c: Renamed from tests/test-mbsrchr.c.
        * tests/test-mbsrchr1.sh: New file, based on
        tests/test-mbmemcasecmp-3.sh.
        * tests/test-mbsrchr1.c: New file.
        * modules/mbsrchr-tests (Files): Update accordingly. Add locale-en.m4,
        locale-fr.m4.
        (configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8.
        (Makefile.am): Arrange to compile test-mbsrchr1 and test-mbsrchr2 and to
        run test-mbsrchr1.sh, test-mbsrchr2.sh.
        * tests/test-mbscspn.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add test cases with incomplete characters.
        * tests/test-mbspbrk.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add test cases with incomplete characters.
        * tests/test-mbsspn.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add test cases with incomplete characters.

2026-05-25  Bruno Haible  <[email protected]>

        mbuiter: Implement multi-byte per encoding error (MEE) consistently.
        * lib/mbuiter.h: Include mbiter-aux.h.
        (struct mbuiter_multi): Add field is_utf8.
        (mbuiter_multi_next): Invoke mbiter_is_utf8,
        mbiter_utf8_maximal_subpart.
        (mbui_init): Initialize the field is_utf8.
        * modules/mbuiter (Depends-on): Add mbiter-aux.
        * tests/test-mbsstr2.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add test cases with incomplete characters.
        * tests/test-mbsstr1.c: Update comments.
        * tests/test-mbsstr3.c: Likewise.
        * tests/test-mbscasestr2.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add test cases with incomplete characters.
        * tests/test-mbscasestr1.c: Update comments.
        * tests/test-mbscasestr3.c: Likewise.
        * tests/test-mbscasestr4.c: Likewise.

2026-05-25  Bruno Haible  <[email protected]>

        mbiterf: Implement multi-byte per encoding error (MEE) consistently.
        * lib/mbiterf.h: Include mbiter-aux.h.
        (struct mbif_state): Add field is_utf8.
        (mbiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
        (mbif_init): Initialize the field is_utf8.
        * modules/mbiterf (Depends-on): Add mbiter-aux.
        * tests/test-mbsnlen.c (main): Add test cases with incomplete characters
        not at the end of the string.

2026-05-25  Bruno Haible  <[email protected]>

        mbiter: Implement multi-byte per encoding error (MEE) consistently.
        * lib/mbiter.h: Include mbiter-aux.h.
        (struct mbiter_multi): Add field is_utf8.
        (mbiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
        (mbi_init): Initialize the field is_utf8.
        * modules/mbiter (Depends-on): Add mbiter-aux.
        * tests/test-mbs_startswith2.c (main): Add test cases with incomplete
        characters not at the end of the string.
        * tests/test-mbs_endswith2.c (OR): New macro, copied from
        tests/test-mbsnlen.c.
        (main): Add test cases with incomplete characters not at the end of the
        string.

2026-05-25  Bruno Haible  <[email protected]>

        mbiter-aux: New module.
        * lib/mbiter-aux.h: New file.
        * lib/mbiter-aux.c: New file. mbiter_is_utf8 is based on
        lib/localeinfo.c.
        * modules/mbiter-aux: New file.

>From 86e2abad044fb89d3a848a9117b91e5f25946c1a Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Mon, 25 May 2026 18:58:12 +0200
Subject: [PATCH 1/5] mbiter-aux: New module.

* lib/mbiter-aux.h: New file.
* lib/mbiter-aux.c: New file. mbiter_is_utf8 is based on
lib/localeinfo.c.
* modules/mbiter-aux: New file.
---
 ChangeLog          |  8 +++++
 lib/mbiter-aux.c   | 81 ++++++++++++++++++++++++++++++++++++++++++++++
 lib/mbiter-aux.h   | 44 +++++++++++++++++++++++++
 modules/mbiter-aux | 31 ++++++++++++++++++
 4 files changed, 164 insertions(+)
 create mode 100644 lib/mbiter-aux.c
 create mode 100644 lib/mbiter-aux.h
 create mode 100644 modules/mbiter-aux

diff --git a/ChangeLog b/ChangeLog
index 7c1b291c75..7fdeebc8dd 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2026-05-25  Bruno Haible  <[email protected]>
+
+	mbiter-aux: New module.
+	* lib/mbiter-aux.h: New file.
+	* lib/mbiter-aux.c: New file. mbiter_is_utf8 is based on
+	lib/localeinfo.c.
+	* modules/mbiter-aux: New file.
+
 2026-05-25  Waldemar Brodkorb  <[email protected]>
 
 	mbrtoc32: do not optimze for uClibc-ng
diff --git a/lib/mbiter-aux.c b/lib/mbiter-aux.c
new file mode 100644
index 0000000000..125e02ebe3
--- /dev/null
+++ b/lib/mbiter-aux.c
@@ -0,0 +1,81 @@
+/* Auxiliary functions for iterating through multibyte strings.
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation; either version 2.1 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <[email protected]>.  */
+
+#include <config.h>
+
+/* Specification.  */
+#include "mbiter-aux.h"
+
+#include <uchar.h>
+
+bool
+mbiter_is_utf8 (int *cache)
+{
+  if (*cache < 0)
+    {
+      /* UTF-8 is the only encoding in use which maps the bytes 0xC4 0x80
+         to U+0100.  (See libiconv/tests/*.TXT for all the mapping tables.)
+         We can assume that in this case, the char32_t encoding is Unicode
+         (not platform-dependent like for other locale encodings).  */
+      mbstate_t state; mbszero (&state);
+      char32_t wc;
+      *cache = (mbrtoc32 (&wc, "\xc4\x80", 2, &state) == 2 && wc == 0x100);
+    }
+  return *cache;
+}
+
+/* If the current locale encoding is UTF-8 and a preceding
+     mbrtoc32 (&uc, S, N, &state)
+   invocation returned (size_t) -1, this function returns the number of
+   initial bytes that form a maximal subpart in the sense of
+   https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127..129.
+   The result is >= 1, <= N.  */
+size_t
+mbiter_utf8_maximal_subpart (const char *s, size_t n)
+{
+  /* Based on lib/unistr/u8-mbtouc.c.  */
+  if (n >= 2)
+    {
+      unsigned char c = (unsigned char) *s;
+      if (c >= 0xe0)
+        {
+          if (c < 0xf0)
+            {
+              unsigned char c2 = (unsigned char) s[1];
+              if ((c2 ^ 0x80) < 0x40
+                  && (c >= 0xe1 || c2 >= 0xa0)
+                  && (c != 0xed || c2 < 0xa0))
+                return 2;
+            }
+          else if (c <= 0xf4)
+            {
+              unsigned char c2 = (unsigned char) s[1];
+              if ((c2 ^ 0x80) < 0x40
+                  && (c >= 0xf1 || c2 >= 0x90)
+                  && (c < 0xf4 || (/* c == 0xf4 && */ c2 < 0x90)))
+                {
+                  if (n >= 3 && ((unsigned char) s[2] ^ 0x80) < 0x40)
+                    return 3;
+                  else
+                    return 2;
+                }
+            }
+        }
+    }
+  return 1;
+}
diff --git a/lib/mbiter-aux.h b/lib/mbiter-aux.h
new file mode 100644
index 0000000000..972b4c0264
--- /dev/null
+++ b/lib/mbiter-aux.h
@@ -0,0 +1,44 @@
+/* Auxiliary functions for iterating through multibyte strings.
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation; either version 2.1 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <[email protected]>.  */
+
+#ifndef _MBITER_AUX_H
+#define _MBITER_AUX_H 1
+
+#include <stddef.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Determines whether the current locale encoding is UTF-8.
+   Stores the value in *CACHE, that should be pre-initialized with -1.  */
+extern bool mbiter_is_utf8 (int *cache);
+
+/* If the current locale encoding is UTF-8 and a preceding
+     mbrtoc32 (&uc, S, N, &state)
+   invocation returned (size_t) -1, this function returns the number of
+   initial bytes that form a maximal subpart in the sense of
+   https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127..129.
+   The result is >= 1, <= N.  */
+extern size_t mbiter_utf8_maximal_subpart (const char *s, size_t n);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _MBITER_AUX_H */
diff --git a/modules/mbiter-aux b/modules/mbiter-aux
new file mode 100644
index 0000000000..55086bdc24
--- /dev/null
+++ b/modules/mbiter-aux
@@ -0,0 +1,31 @@
+Description:
+Auxiliary functions for iterating through multibyte strings.
+
+Files:
+lib/mbiter-aux.h
+lib/mbiter-aux.c
+
+Depends-on:
+mbrtoc32
+mbsinit
+mbszero
+bool
+
+configure.ac:
+
+Makefile.am:
+lib_SOURCES += mbiter-aux.h mbiter-aux.c
+
+Include:
+"mbiter-aux.h"
+
+Link:
+$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise
+$(MBRTOWC_LIB)
+$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise
+
+License:
+LGPLv2+
+
+Maintainer:
+all
-- 
2.54.0

>From 1429a919428e4dd697903e685476025034315541 Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Mon, 25 May 2026 19:06:55 +0200
Subject: [PATCH 2/5] mbiter: Implement multi-byte per encoding error
 consistently.

* lib/mbiter.h: Include mbiter-aux.h.
(struct mbiter_multi): Add field is_utf8.
(mbiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
(mbi_init): Initialize the field is_utf8.
* modules/mbiter (Depends-on): Add mbiter-aux.
* tests/test-mbs_startswith2.c (main): Add test cases with incomplete
characters not at the end of the string.
* tests/test-mbs_endswith2.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add test cases with incomplete characters not at the end of the
string.
---
 ChangeLog                    | 15 +++++++++++++++
 lib/mbiter.h                 | 22 +++++++++++++++-------
 modules/mbiter               |  1 +
 tests/test-mbs_endswith2.c   | 17 +++++++++++++++++
 tests/test-mbs_startswith2.c |  3 +++
 5 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 7fdeebc8dd..891109da37 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,18 @@
+2026-05-25  Bruno Haible  <[email protected]>
+
+	mbiter: Implement multi-byte per encoding error (MEE) consistently.
+	* lib/mbiter.h: Include mbiter-aux.h.
+	(struct mbiter_multi): Add field is_utf8.
+	(mbiter_multi_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
+	(mbi_init): Initialize the field is_utf8.
+	* modules/mbiter (Depends-on): Add mbiter-aux.
+	* tests/test-mbs_startswith2.c (main): Add test cases with incomplete
+	characters not at the end of the string.
+	* tests/test-mbs_endswith2.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add test cases with incomplete characters not at the end of the
+	string.
+
 2026-05-25  Bruno Haible  <[email protected]>
 
 	mbiter-aux: New module.
diff --git a/lib/mbiter.h b/lib/mbiter.h
index 78bcd39d17..d069f2773d 100644
--- a/lib/mbiter.h
+++ b/lib/mbiter.h
@@ -95,6 +95,7 @@
 #include <wchar.h>
 
 #include "mbchar.h"
+#include "mbiter-aux.h"
 
 _GL_INLINE_HEADER_BEGIN
 #ifndef MBITER_INLINE
@@ -119,6 +120,7 @@ struct mbiter_multi
                            before and after every mbiter_multi_next invocation.
                          */
   bool next_done;       /* true if mbi_avail has already filled the following */
+  int is_utf8;          /* A cache of mbiter_is_utf8.  */
   struct mbchar cur;    /* the current character:
         const char *cur.ptr          pointer to current character
         The following are only valid after mbi_avail.
@@ -155,14 +157,18 @@ mbiter_multi_next (struct mbiter_multi *iter)
       assert (mbsinit (&iter->state));
       #if !GNULIB_MBRTOC32_REGULAR
       iter->in_shift = true;
-    with_shift:
+    with_shift:;
       #endif
-      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
-                                  iter->limit - iter->cur.ptr, &iter->state);
+      size_t avail_bytes = iter->limit - iter->cur.ptr;
+      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, avail_bytes,
+                                  &iter->state);
       if (iter->cur.bytes == (size_t) -1)
         {
           /* An invalid multibyte sequence was encountered.  */
-          iter->cur.bytes = 1;
+          iter->cur.bytes =
+            (mbiter_is_utf8 (&iter->is_utf8)
+             ? mbiter_utf8_maximal_subpart (iter->cur.ptr, avail_bytes)
+             : 1);
           iter->cur.wc_valid = false;
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
@@ -173,7 +179,7 @@ mbiter_multi_next (struct mbiter_multi *iter)
       else if (iter->cur.bytes == (size_t) -2)
         {
           /* An incomplete multibyte character at the end.  */
-          iter->cur.bytes = iter->limit - iter->cur.ptr;
+          iter->cur.bytes = avail_bytes;
           iter->cur.wc_valid = false;
           #if !GNULIB_MBRTOC32_REGULAR
           /* Cause the next mbi_avail invocation to return false.  */
@@ -237,13 +243,15 @@ typedef struct mbiter_multi mbi_iterator_t;
 #define mbi_init(iter, startptr, length) \
   ((iter).cur.ptr = (startptr), (iter).limit = (iter).cur.ptr + (length), \
    (iter).in_shift = false, mbszero (&(iter).state), \
-   (iter).next_done = false)
+   (iter).next_done = false, \
+   (iter).is_utf8 = -1)
 #else
 /* Optimized: no in_shift.  */
 #define mbi_init(iter, startptr, length) \
   ((iter).cur.ptr = (startptr), (iter).limit = (iter).cur.ptr + (length), \
    mbszero (&(iter).state), \
-   (iter).next_done = false)
+   (iter).next_done = false, \
+   (iter).is_utf8 = -1)
 #endif
 #if !GNULIB_MBRTOC32_REGULAR
 #define mbi_avail(iter) \
diff --git a/modules/mbiter b/modules/mbiter
index 19c75b4ae1..c16d36e213 100644
--- a/modules/mbiter
+++ b/modules/mbiter
@@ -13,6 +13,7 @@ mbchar
 mbrtoc32
 mbsinit
 mbszero
+mbiter-aux
 uchar-h
 bool
 
diff --git a/tests/test-mbs_endswith2.c b/tests/test-mbs_endswith2.c
index 17ccc1f6e0..d9bcc9f881 100644
--- a/tests/test-mbs_endswith2.c
+++ b/tests/test-mbs_endswith2.c
@@ -25,6 +25,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -92,10 +106,13 @@ main ()
   /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
   ASSERT (!mbs_endswith ("\341\200\240", "\200\240"));
   ASSERT (!mbs_endswith ("\341\200\240", "\240"));
+  ASSERT (mbs_endswith ("\341\200X", "\200X") == OR(false,true));
   /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
   ASSERT (!mbs_endswith ("\360\221\222\240", "\221\222\240"));
   ASSERT (!mbs_endswith ("\360\221\222\240", "\222\240"));
   ASSERT (!mbs_endswith ("\360\221\222\240", "\240"));
+  ASSERT (mbs_endswith ("\360\221\222X", "\222X") == OR(false,true));
+  ASSERT (mbs_endswith ("\360\221X", "\221X") == OR(false,true));
 
   /* Two invalid characters should match only if they are identical.  */
   /* "\301\246" = 0xC1 0xA6 is invalid.
diff --git a/tests/test-mbs_startswith2.c b/tests/test-mbs_startswith2.c
index 0ab6a9eeaf..069503af3e 100644
--- a/tests/test-mbs_startswith2.c
+++ b/tests/test-mbs_startswith2.c
@@ -114,6 +114,7 @@ main ()
   ASSERT (!mbs_startswith ("\341\200\240", "\341\200"));
   ASSERT (!mbs_startswith ("\341\200\240", "\341"));
   ASSERT (mbs_startswith ("\341\200", "\341") == OR(false,true));
+  ASSERT (mbs_startswith ("\341\200\341\200", "\341\200"));
   /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
   ASSERT (!mbs_startswith ("\360\221\222\240", "\360\221\222"));
   ASSERT (!mbs_startswith ("\360\221\222\240", "\360\221"));
@@ -121,6 +122,8 @@ main ()
   ASSERT (mbs_startswith ("\360\221\222", "\360\221") == OR(false,true));
   ASSERT (mbs_startswith ("\360\221\222", "\360") == OR(false,true));
   ASSERT (mbs_startswith ("\360\221", "\360") == OR(false,true));
+  ASSERT (mbs_startswith ("\360\221\222\360\221\222", "\360\221\222"));
+  ASSERT (mbs_startswith ("\360\221\360\221", "\360\221"));
 
   /* "\355\240\200" = 0xED 0xA0 0x80 = U+D800 is invalid.
      In fact, "\355\240" = 0xED 0xA0 is already invalid, see
-- 
2.54.0

>From 151d374a800befdcd7f5d186b11a80c372408317 Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Tue, 26 May 2026 00:42:39 +0200
Subject: [PATCH 3/5] mbiterf: Implement multi-byte per encoding error (MEE)
 consistently.

* lib/mbiterf.h: Include mbiter-aux.h.
(struct mbif_state): Add field is_utf8.
(mbiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
(mbif_init): Initialize the field is_utf8.
* modules/mbiterf (Depends-on): Add mbiter-aux.
* tests/test-mbsnlen.c (main): Add test cases with incomplete characters
not at the end of the string.
---
 ChangeLog            | 11 +++++++++++
 lib/mbiterf.h        | 19 ++++++++++++++-----
 modules/mbiterf      |  1 +
 tests/test-mbsnlen.c |  9 +++++++++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 891109da37..1e7d2e1e94 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,14 @@
+2026-05-25  Bruno Haible  <[email protected]>
+
+	mbiterf: Implement multi-byte per encoding error (MEE) consistently.
+	* lib/mbiterf.h: Include mbiter-aux.h.
+	(struct mbif_state): Add field is_utf8.
+	(mbiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
+	(mbif_init): Initialize the field is_utf8.
+	* modules/mbiterf (Depends-on): Add mbiter-aux.
+	* tests/test-mbsnlen.c (main): Add test cases with incomplete characters
+	not at the end of the string.
+
 2026-05-25  Bruno Haible  <[email protected]>
 
 	mbiter: Implement multi-byte per encoding error (MEE) consistently.
diff --git a/lib/mbiterf.h b/lib/mbiterf.h
index 50a39c0f92..655c4d49c4 100644
--- a/lib/mbiterf.h
+++ b/lib/mbiterf.h
@@ -86,6 +86,7 @@
 #include <wchar.h>
 
 #include "mbchar.h"
+#include "mbiter-aux.h"
 
 _GL_INLINE_HEADER_BEGIN
 #ifndef MBITERF_INLINE
@@ -108,6 +109,7 @@ struct mbif_state
                         /* If GNULIB_MBRTOC32_REGULAR, it is in an initial state
                            before and after every mbiterf_next invocation.
                          */
+  int is_utf8;          /* A cache of mbiter_is_utf8.  */
 };
 
 MBITERF_INLINE mbchar_t
@@ -135,18 +137,23 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr)
       ps->in_shift = true;
     with_shift:;
       #endif
+      size_t avail_bytes = endptr - iter;
       size_t bytes;
       char32_t wc;
-      bytes = mbrtoc32 (&wc, iter, endptr - iter, &ps->state);
+      bytes = mbrtoc32 (&wc, iter, avail_bytes, &ps->state);
       if (bytes == (size_t) -1)
         {
           /* An invalid multibyte sequence was encountered.  */
+          size_t ebytes =
+            (mbiter_is_utf8 (&ps->is_utf8)
+             ? mbiter_utf8_maximal_subpart (iter, avail_bytes)
+             : 1);
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
           ps->in_shift = false;
           #endif
           mbszero (&ps->state);
-          return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false };
+          return (mbchar_t) { .ptr = iter, .bytes = ebytes, .wc_valid = false };
         }
       else if (bytes == (size_t) -2)
         {
@@ -156,7 +163,7 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr)
           #endif
           /* Whether to reset ps->state or not is not important; the string end
              is reached anyway.  */
-          return (mbchar_t) { .ptr = iter, .bytes = endptr - iter, .wc_valid = false };
+          return (mbchar_t) { .ptr = iter, .bytes = avail_bytes, .wc_valid = false };
         }
       else
         {
@@ -189,11 +196,13 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr)
 typedef struct mbif_state mbif_state_t;
 #if !GNULIB_MBRTOC32_REGULAR
 #define mbif_init(st) \
-  ((st).in_shift = false, mbszero (&(st).state))
+  ((st).in_shift = false, mbszero (&(st).state), \
+   (st).is_utf8 = -1)
 #else
 /* Optimized: no in_shift.  */
 #define mbif_init(st) \
-  (mbszero (&(st).state))
+  (mbszero (&(st).state), \
+   (st).is_utf8 = -1)
 #endif
 #if !GNULIB_MBRTOC32_REGULAR
 #define mbif_avail(st, iter, endptr) ((st).in_shift || ((iter) < (endptr)))
diff --git a/modules/mbiterf b/modules/mbiterf
index 9826b77c73..7825271c3b 100644
--- a/modules/mbiterf
+++ b/modules/mbiterf
@@ -13,6 +13,7 @@ mbchar
 mbrtoc32
 mbsinit
 mbszero
+mbiter-aux
 uchar-h
 bool
 
diff --git a/tests/test-mbsnlen.c b/tests/test-mbsnlen.c
index 66a4d70a88..4d4bfff475 100644
--- a/tests/test-mbsnlen.c
+++ b/tests/test-mbsnlen.c
@@ -82,9 +82,18 @@ main ()
   ASSERT (mbsnlen ("\360\237\220\203", 4) == 1);
   ASSERT (mbsnlen ("\360\237\220\203", 5) == 2);
 
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
   ASSERT (mbsnlen ("\303", 1) == 1); /* invalid multibyte sequence */
+  ASSERT (mbsnlen ("\303\303", 2) == 2); /* 2x invalid multibyte sequence */
+
   ASSERT (mbsnlen ("\342\202", 2) == OR(1,2)); /* invalid multibyte sequence */
+  ASSERT (mbsnlen ("\342\202\342\202", 4) == 2 * OR(1,2)); /* 2x invalid multibyte sequence */
+
   ASSERT (mbsnlen ("\360\237\220", 3) == OR(1,3)); /* invalid multibyte sequence */
+  ASSERT (mbsnlen ("\360\237\220\360\237\220", 6) == 2 * OR(1,3)); /* 2x invalid multibyte sequence */
 
   return test_exit_status;
 }
-- 
2.54.0

From 8bfd5e31981e99480993c44de6f77e572ecc0a2a Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Tue, 26 May 2026 00:53:31 +0200
Subject: [PATCH 4/5] mbuiter: Implement multi-byte per encoding error (MEE)
 consistently.

* lib/mbuiter.h: Include mbiter-aux.h.
(struct mbuiter_multi): Add field is_utf8.
(mbuiter_multi_next): Invoke mbiter_is_utf8,
mbiter_utf8_maximal_subpart.
(mbui_init): Initialize the field is_utf8.
* modules/mbuiter (Depends-on): Add mbiter-aux.
* tests/test-mbsstr2.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add test cases with incomplete characters.
* tests/test-mbsstr1.c: Update comments.
* tests/test-mbsstr3.c: Likewise.
* tests/test-mbscasestr2.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add test cases with incomplete characters.
* tests/test-mbscasestr1.c: Update comments.
* tests/test-mbscasestr3.c: Likewise.
* tests/test-mbscasestr4.c: Likewise.
---
 ChangeLog                | 21 ++++++++++
 lib/mbuiter.h            | 17 +++++---
 modules/mbuiter          |  1 +
 tests/test-mbscasestr1.c |  2 +-
 tests/test-mbscasestr2.c | 84 +++++++++++++++++++++++++++++++++++++++-
 tests/test-mbscasestr3.c |  2 +-
 tests/test-mbscasestr4.c |  2 +-
 tests/test-mbsstr1.c     |  2 +-
 tests/test-mbsstr2.c     | 84 +++++++++++++++++++++++++++++++++++++++-
 tests/test-mbsstr3.c     |  2 +-
 10 files changed, 204 insertions(+), 13 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 1e7d2e1e94..7c6b25c471 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,24 @@
+2026-05-25  Bruno Haible  <[email protected]>
+
+	mbuiter: Implement multi-byte per encoding error (MEE) consistently.
+	* lib/mbuiter.h: Include mbiter-aux.h.
+	(struct mbuiter_multi): Add field is_utf8.
+	(mbuiter_multi_next): Invoke mbiter_is_utf8,
+	mbiter_utf8_maximal_subpart.
+	(mbui_init): Initialize the field is_utf8.
+	* modules/mbuiter (Depends-on): Add mbiter-aux.
+	* tests/test-mbsstr2.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add test cases with incomplete characters.
+	* tests/test-mbsstr1.c: Update comments.
+	* tests/test-mbsstr3.c: Likewise.
+	* tests/test-mbscasestr2.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add test cases with incomplete characters.
+	* tests/test-mbscasestr1.c: Update comments.
+	* tests/test-mbscasestr3.c: Likewise.
+	* tests/test-mbscasestr4.c: Likewise.
+
 2026-05-25  Bruno Haible  <[email protected]>
 
 	mbiterf: Implement multi-byte per encoding error (MEE) consistently.
diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 0f13e732e1..b5bb18305f 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -103,6 +103,7 @@
 #include <wchar.h>
 
 #include "mbchar.h"
+#include "mbiter-aux.h"
 #include "strnlen1.h"
 
 _GL_INLINE_HEADER_BEGIN
@@ -128,6 +129,7 @@ struct mbuiter_multi
                          */
   bool next_done;       /* true if mbui_avail has already filled the following */
   unsigned int cur_max; /* A cache of MB_CUR_MAX.  */
+  int is_utf8;          /* A cache of mbiter_is_utf8.  */
   struct mbchar cur;    /* the current character:
         const char *cur.ptr          pointer to current character
         The following are only valid after mbui_avail.
@@ -164,15 +166,18 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
       assert (mbsinit (&iter->state));
       #if !GNULIB_MBRTOC32_REGULAR
       iter->in_shift = true;
-    with_shift:
+    with_shift:;
       #endif
-      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
-                                  strnlen1 (iter->cur.ptr, iter->cur_max),
+      size_t avail_bytes = strnlen1 (iter->cur.ptr, iter->cur_max);
+      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, avail_bytes,
                                   &iter->state);
       if (iter->cur.bytes == (size_t) -1)
         {
           /* An invalid multibyte sequence was encountered.  */
-          iter->cur.bytes = 1;
+          iter->cur.bytes =
+            (mbiter_is_utf8 (&iter->is_utf8)
+             ? mbiter_utf8_maximal_subpart (iter->cur.ptr, avail_bytes)
+             : 1);
           iter->cur.wc_valid = false;
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
@@ -243,14 +248,14 @@ typedef struct mbuiter_multi mbui_iterator_t;
   ((iter).cur.ptr = (startptr), \
    (iter).in_shift = false, mbszero (&(iter).state), \
    (iter).next_done = false, \
-   (iter).cur_max = MB_CUR_MAX)
+   (iter).cur_max = MB_CUR_MAX, (iter).is_utf8 = -1)
 #else
 /* Optimized: no in_shift.  */
 #define mbui_init(iter, startptr) \
   ((iter).cur.ptr = (startptr), \
    mbszero (&(iter).state), \
    (iter).next_done = false, \
-   (iter).cur_max = MB_CUR_MAX)
+   (iter).cur_max = MB_CUR_MAX, (iter).is_utf8 = -1)
 #endif
 #define mbui_avail(iter) \
   (mbuiter_multi_next (&(iter)), !mb_isnul ((iter).cur))
diff --git a/modules/mbuiter b/modules/mbuiter
index d9deba2d4b..f9daba54e2 100644
--- a/modules/mbuiter
+++ b/modules/mbuiter
@@ -13,6 +13,7 @@ mbchar
 mbrtoc32
 mbsinit
 mbszero
+mbiter-aux
 uchar-h
 bool
 strnlen1
diff --git a/tests/test-mbscasestr1.c b/tests/test-mbscasestr1.c
index 1d98bef956..ddc3ab84dd 100644
--- a/tests/test-mbscasestr1.c
+++ b/tests/test-mbscasestr1.c
@@ -1,4 +1,4 @@
-/* Test of case-insensitive searching in a string.
+/* Test of case-insensitive searching in a string in the "C" locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
diff --git a/tests/test-mbscasestr2.c b/tests/test-mbscasestr2.c
index 7f05ebf91f..66582eba40 100644
--- a/tests/test-mbscasestr2.c
+++ b/tests/test-mbscasestr2.c
@@ -1,4 +1,4 @@
-/* Test of searching in a string.
+/* Test of searching in a string in a UTF-8 locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
@@ -25,6 +25,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -50,6 +64,74 @@ main ()
     ASSERT (result == NULL);
   }
 
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "f\341\200\341\200";
+    const char *result = mbscasestr (input, "");
+    ASSERT (result == input);
+  }
+  {
+    const char input[] = "f\341\200\341\200";
+    const char *result = mbscasestr (input, "\341\200");
+    ASSERT (result == input + 1);
+  }
+  {
+    const char input[] = "f\341\200\341\200";
+    const char *result = mbscasestr (input, "\200\341");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbscasestr (input, "");
+    ASSERT (result == input);
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbscasestr (input, "\360\221\222");
+    ASSERT (result == input + 1);
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbscasestr (input, "\221\222\360\221");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbscasestr (input, "\221\222\360");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbscasestr (input, "\222\360\221");
+    ASSERT (result == OR (NULL, input + 3));
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbscasestr (input, "\222\360");
+    ASSERT (result == OR (NULL, input + 3));
+  }
+  {
+    const char input[] = "f\360\221\360\221";
+    const char *result = mbscasestr (input, "");
+    ASSERT (result == input);
+  }
+  {
+    const char input[] = "f\360\221\360\221";
+    const char *result = mbscasestr (input, "\360\221");
+    ASSERT (result == input + 1);
+  }
+  {
+    const char input[] = "f\360\221\360\221";
+    const char *result = mbscasestr (input, "\221\360");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+
   {
     const char input[] = "\303\204BC \303\204BCD\303\204B \303\204BCD\303\204BCD\303\204BDE"; /* "??BC ??BCD??B ??BCD??BCD??BDE" */
     const char *result = mbscasestr (input, "\303\244BCD\303\204BD"); /* "??BCD??BD" */
diff --git a/tests/test-mbscasestr3.c b/tests/test-mbscasestr3.c
index bccd34deaa..496d2293d6 100644
--- a/tests/test-mbscasestr3.c
+++ b/tests/test-mbscasestr3.c
@@ -1,4 +1,4 @@
-/* Test of case-insensitive searching in a string.
+/* Test of case-insensitive searching in a string in a GB18030 locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
diff --git a/tests/test-mbscasestr4.c b/tests/test-mbscasestr4.c
index 41ce6a91a1..7e8a026a05 100644
--- a/tests/test-mbscasestr4.c
+++ b/tests/test-mbscasestr4.c
@@ -1,4 +1,4 @@
-/* Test of case-insensitive searching in a string.
+/* Test of case-insensitive searching in a string in a Turkish locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
diff --git a/tests/test-mbsstr1.c b/tests/test-mbsstr1.c
index d7a73486a3..bcdbb5c83c 100644
--- a/tests/test-mbsstr1.c
+++ b/tests/test-mbsstr1.c
@@ -1,4 +1,4 @@
-/* Test of searching in a string.
+/* Test of searching in a string in the "C" locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
diff --git a/tests/test-mbsstr2.c b/tests/test-mbsstr2.c
index 93db31ef94..184abbe8f4 100644
--- a/tests/test-mbsstr2.c
+++ b/tests/test-mbsstr2.c
@@ -1,4 +1,4 @@
-/* Test of searching in a string.
+/* Test of searching in a string in a UTF-8 locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
@@ -25,6 +25,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -50,6 +64,74 @@ main ()
     ASSERT (result == NULL);
   }
 
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "f\341\200\341\200";
+    const char *result = mbsstr (input, "");
+    ASSERT (result == input);
+  }
+  {
+    const char input[] = "f\341\200\341\200";
+    const char *result = mbsstr (input, "\341\200");
+    ASSERT (result == input + 1);
+  }
+  {
+    const char input[] = "f\341\200\341\200";
+    const char *result = mbsstr (input, "\200\341");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbsstr (input, "");
+    ASSERT (result == input);
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbsstr (input, "\360\221\222");
+    ASSERT (result == input + 1);
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbsstr (input, "\221\222\360\221");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbsstr (input, "\221\222\360");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbsstr (input, "\222\360\221");
+    ASSERT (result == OR (NULL, input + 3));
+  }
+  {
+    const char input[] = "f\360\221\222\360\221\222";
+    const char *result = mbsstr (input, "\222\360");
+    ASSERT (result == OR (NULL, input + 3));
+  }
+  {
+    const char input[] = "f\360\221\360\221";
+    const char *result = mbsstr (input, "");
+    ASSERT (result == input);
+  }
+  {
+    const char input[] = "f\360\221\360\221";
+    const char *result = mbsstr (input, "\360\221");
+    ASSERT (result == input + 1);
+  }
+  {
+    const char input[] = "f\360\221\360\221";
+    const char *result = mbsstr (input, "\221\360");
+    ASSERT (result == OR (NULL, input + 2));
+  }
+
   {
     const char input[] = "\303\204BC \303\204BCD\303\204B \303\204BCD\303\204BCD\303\204BDE"; /* "??BC ??BCD??B ??BCD??BCD??BDE" */
     const char *result = mbsstr (input, "\303\204BCD\303\204BD"); /* "??BCD??BD" */
diff --git a/tests/test-mbsstr3.c b/tests/test-mbsstr3.c
index 71196150b6..66dc276a55 100644
--- a/tests/test-mbsstr3.c
+++ b/tests/test-mbsstr3.c
@@ -1,4 +1,4 @@
-/* Test of searching in a string.
+/* Test of searching in a string in a GB18030 locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
-- 
2.54.0

From 1e7cbc30fd9fa8790583e7c24ac3ca2f46542bdf Mon Sep 17 00:00:00 2001
From: Bruno Haible <[email protected]>
Date: Tue, 26 May 2026 01:29:29 +0200
Subject: [PATCH 5/5] mbuiterf: Implement multi-byte per encoding error (MEE)
 consistently.

* lib/mbuiterf.h: Include mbiter-aux.h.
(struct mbuif_state): Add field is_utf8.
(mbuiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
(mbuif_init): Initialize the field is_utf8.
* modules/mbuiterf (Depends-on): Add mbiter-aux.
* tests/test-mbslen.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add more test cases with incomplete characters.
* tests/test-mbschr2.sh: Renamed from tests/test-mbschr.sh.
* tests/test-mbschr2.c: Renamed from tests/test-mbschr.c.
* tests/test-mbschr1.sh: New file, based on
tests/test-mbmemcasecmp-3.sh.
* tests/test-mbschr1.c: New file.
* modules/mbschr-tests (Files): Update accordingly. Add locale-en.m4,
locale-fr.m4.
(configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8.
(Makefile.am): Arrange to compile test-mbschr1 and test-mbschr2 and to
run test-mbschr1.sh, test-mbschr2.sh.
* tests/test-mbsrchr2.sh: Renamed from tests/test-mbsrchr.sh.
* tests/test-mbsrchr2.c: Renamed from tests/test-mbsrchr.c.
* tests/test-mbsrchr1.sh: New file, based on
tests/test-mbmemcasecmp-3.sh.
* tests/test-mbsrchr1.c: New file.
* modules/mbsrchr-tests (Files): Update accordingly. Add locale-en.m4,
locale-fr.m4.
(configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8.
(Makefile.am): Arrange to compile test-mbsrchr1 and test-mbsrchr2 and to
run test-mbsrchr1.sh, test-mbsrchr2.sh.
* tests/test-mbscspn.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add test cases with incomplete characters.
* tests/test-mbspbrk.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add test cases with incomplete characters.
* tests/test-mbsspn.c (OR): New macro, copied from
tests/test-mbsnlen.c.
(main): Add test cases with incomplete characters.
---
 ChangeLog                                   |  41 ++++++++
 lib/mbuiterf.h                              |  15 ++-
 modules/mbschr-tests                        |  22 ++--
 modules/mbsrchr-tests                       |  22 ++--
 modules/mbuiterf                            |   1 +
 tests/test-mbschr1.c                        | 107 ++++++++++++++++++++
 tests/test-mbschr1.sh                       |  23 +++++
 tests/{test-mbschr.c => test-mbschr2.c}     |   2 +-
 tests/{test-mbschr.sh => test-mbschr2.sh}   |   2 +-
 tests/test-mbscspn.c                        |  78 ++++++++++++++
 tests/test-mbslen.c                         |  28 ++++-
 tests/test-mbspbrk.c                        |  78 ++++++++++++++
 tests/test-mbsrchr1.c                       | 107 ++++++++++++++++++++
 tests/test-mbsrchr1.sh                      |  23 +++++
 tests/{test-mbsrchr.c => test-mbsrchr2.c}   |   0
 tests/{test-mbsrchr.sh => test-mbsrchr2.sh} |   2 +-
 tests/test-mbsspn.c                         |  58 +++++++++++
 17 files changed, 587 insertions(+), 22 deletions(-)
 create mode 100644 tests/test-mbschr1.c
 create mode 100755 tests/test-mbschr1.sh
 rename tests/{test-mbschr.c => test-mbschr2.c} (96%)
 rename tests/{test-mbschr.sh => test-mbschr2.sh} (90%)
 create mode 100644 tests/test-mbsrchr1.c
 create mode 100755 tests/test-mbsrchr1.sh
 rename tests/{test-mbsrchr.c => test-mbsrchr2.c} (100%)
 rename tests/{test-mbsrchr.sh => test-mbsrchr2.sh} (90%)

diff --git a/ChangeLog b/ChangeLog
index 7c6b25c471..3b9f26165d 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,44 @@
+2026-05-25  Bruno Haible  <[email protected]>
+
+	mbuiterf: Implement multi-byte per encoding error (MEE) consistently.
+	* lib/mbuiterf.h: Include mbiter-aux.h.
+	(struct mbuif_state): Add field is_utf8.
+	(mbuiterf_next): Invoke mbiter_is_utf8, mbiter_utf8_maximal_subpart.
+	(mbuif_init): Initialize the field is_utf8.
+	* modules/mbuiterf (Depends-on): Add mbiter-aux.
+	* tests/test-mbslen.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add more test cases with incomplete characters.
+	* tests/test-mbschr2.sh: Renamed from tests/test-mbschr.sh.
+	* tests/test-mbschr2.c: Renamed from tests/test-mbschr.c.
+	* tests/test-mbschr1.sh: New file, based on
+	tests/test-mbmemcasecmp-3.sh.
+	* tests/test-mbschr1.c: New file.
+	* modules/mbschr-tests (Files): Update accordingly. Add locale-en.m4,
+	locale-fr.m4.
+	(configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8.
+	(Makefile.am): Arrange to compile test-mbschr1 and test-mbschr2 and to
+	run test-mbschr1.sh, test-mbschr2.sh.
+	* tests/test-mbsrchr2.sh: Renamed from tests/test-mbsrchr.sh.
+	* tests/test-mbsrchr2.c: Renamed from tests/test-mbsrchr.c.
+	* tests/test-mbsrchr1.sh: New file, based on
+	tests/test-mbmemcasecmp-3.sh.
+	* tests/test-mbsrchr1.c: New file.
+	* modules/mbsrchr-tests (Files): Update accordingly. Add locale-en.m4,
+	locale-fr.m4.
+	(configure.ac): Invoke gt_LOCALE_EN_UTF8, gt_LOCALE_FR_UTF8.
+	(Makefile.am): Arrange to compile test-mbsrchr1 and test-mbsrchr2 and to
+	run test-mbsrchr1.sh, test-mbsrchr2.sh.
+	* tests/test-mbscspn.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add test cases with incomplete characters.
+	* tests/test-mbspbrk.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add test cases with incomplete characters.
+	* tests/test-mbsspn.c (OR): New macro, copied from
+	tests/test-mbsnlen.c.
+	(main): Add test cases with incomplete characters.
+
 2026-05-25  Bruno Haible  <[email protected]>
 
 	mbuiter: Implement multi-byte per encoding error (MEE) consistently.
diff --git a/lib/mbuiterf.h b/lib/mbuiterf.h
index f8cb0f9595..19761a88a4 100644
--- a/lib/mbuiterf.h
+++ b/lib/mbuiterf.h
@@ -94,6 +94,7 @@
 #include <wchar.h>
 
 #include "mbchar.h"
+#include "mbiter-aux.h"
 #include "strnlen1.h"
 
 _GL_INLINE_HEADER_BEGIN
@@ -118,6 +119,7 @@ struct mbuif_state
                            before and after every mbuiterf_next invocation.
                          */
   unsigned int cur_max; /* A cache of MB_CUR_MAX.  */
+  int is_utf8;          /* A cache of mbiter_is_utf8.  */
 };
 
 MBUITERF_INLINE mbchar_t
@@ -145,18 +147,23 @@ mbuiterf_next (struct mbuif_state *ps, const char *iter)
       ps->in_shift = true;
     with_shift:;
       #endif
+      size_t avail_bytes = strnlen1 (iter, ps->cur_max);
       size_t bytes;
       char32_t wc;
-      bytes = mbrtoc32 (&wc, iter, strnlen1 (iter, ps->cur_max), &ps->state);
+      bytes = mbrtoc32 (&wc, iter, avail_bytes, &ps->state);
       if (bytes == (size_t) -1)
         {
           /* An invalid multibyte sequence was encountered.  */
+          size_t ebytes =
+            (mbiter_is_utf8 (&ps->is_utf8)
+             ? mbiter_utf8_maximal_subpart (iter, avail_bytes)
+             : 1);
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
           ps->in_shift = false;
           #endif
           mbszero (&ps->state);
-          return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false };
+          return (mbchar_t) { .ptr = iter, .bytes = ebytes, .wc_valid = false };
         }
       else if (bytes == (size_t) -2)
         {
@@ -197,12 +204,12 @@ typedef struct mbuif_state mbuif_state_t;
 #if !GNULIB_MBRTOC32_REGULAR
 #define mbuif_init(st) \
   ((st).in_shift = false, mbszero (&(st).state), \
-   (st).cur_max = MB_CUR_MAX)
+   (st).cur_max = MB_CUR_MAX, (st).is_utf8 = -1)
 #else
 /* Optimized: no in_shift.  */
 #define mbuif_init(st) \
   (mbszero (&(st).state), \
-   (st).cur_max = MB_CUR_MAX)
+   (st).cur_max = MB_CUR_MAX, (st).is_utf8 = -1)
 #endif
 #if !GNULIB_MBRTOC32_REGULAR
 #define mbuif_avail(st, iter) ((st).in_shift || (*(iter) != '\0'))
diff --git a/modules/mbschr-tests b/modules/mbschr-tests
index ef26e73363..fb879f2baa 100644
--- a/modules/mbschr-tests
+++ b/modules/mbschr-tests
@@ -1,7 +1,11 @@
 Files:
-tests/test-mbschr.sh
-tests/test-mbschr.c
+tests/test-mbschr1.sh
+tests/test-mbschr1.c
+tests/test-mbschr2.sh
+tests/test-mbschr2.c
 tests/macros.h
+m4/locale-en.m4
+m4/locale-fr.m4
 m4/locale-zh.m4
 m4/codeset.m4
 
@@ -9,10 +13,16 @@ Depends-on:
 setlocale
 
 configure.ac:
+gt_LOCALE_EN_UTF8
+gt_LOCALE_FR_UTF8
 gt_LOCALE_ZH_CN
 
 Makefile.am:
-TESTS += test-mbschr.sh
-TESTS_ENVIRONMENT += LOCALE_ZH_CN='@LOCALE_ZH_CN@'
-check_PROGRAMS += test-mbschr
-test_mbschr_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+TESTS += test-mbschr1.sh test-mbschr2.sh
+TESTS_ENVIRONMENT += \
+  LOCALE_EN_UTF8='@LOCALE_EN_UTF8@' \
+  LOCALE_FR_UTF8='@LOCALE_FR_UTF8@' \
+  LOCALE_ZH_CN='@LOCALE_ZH_CN@'
+check_PROGRAMS += test-mbschr1 test-mbschr2
+test_mbschr1_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+test_mbschr2_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
diff --git a/modules/mbsrchr-tests b/modules/mbsrchr-tests
index dba1470789..07243ca86f 100644
--- a/modules/mbsrchr-tests
+++ b/modules/mbsrchr-tests
@@ -1,7 +1,11 @@
 Files:
-tests/test-mbsrchr.sh
-tests/test-mbsrchr.c
+tests/test-mbsrchr1.sh
+tests/test-mbsrchr1.c
+tests/test-mbsrchr2.sh
+tests/test-mbsrchr2.c
 tests/macros.h
+m4/locale-en.m4
+m4/locale-fr.m4
 m4/locale-zh.m4
 m4/codeset.m4
 
@@ -9,10 +13,16 @@ Depends-on:
 setlocale
 
 configure.ac:
+gt_LOCALE_EN_UTF8
+gt_LOCALE_FR_UTF8
 gt_LOCALE_ZH_CN
 
 Makefile.am:
-TESTS += test-mbsrchr.sh
-TESTS_ENVIRONMENT += LOCALE_ZH_CN='@LOCALE_ZH_CN@'
-check_PROGRAMS += test-mbsrchr
-test_mbsrchr_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+TESTS += test-mbsrchr1.sh test-mbsrchr2.sh
+TESTS_ENVIRONMENT += \
+  LOCALE_EN_UTF8='@LOCALE_EN_UTF8@' \
+  LOCALE_FR_UTF8='@LOCALE_FR_UTF8@' \
+  LOCALE_ZH_CN='@LOCALE_ZH_CN@'
+check_PROGRAMS += test-mbsrchr1 test-mbsrchr2
+test_mbsrchr1_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+test_mbsrchr2_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
diff --git a/modules/mbuiterf b/modules/mbuiterf
index e5e22f9d09..d93cc8fa73 100644
--- a/modules/mbuiterf
+++ b/modules/mbuiterf
@@ -13,6 +13,7 @@ mbchar
 mbrtoc32
 mbsinit
 mbszero
+mbiter-aux
 uchar-h
 bool
 strnlen1
diff --git a/tests/test-mbschr1.c b/tests/test-mbschr1.c
new file mode 100644
index 0000000000..1000491e7c
--- /dev/null
+++ b/tests/test-mbschr1.c
@@ -0,0 +1,107 @@
+/* Test of searching a string for a character in a UTF-8 locale.
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <[email protected]>, 2026.  */
+
+#include <config.h>
+
+#include <string.h>
+
+#include <locale.h>
+
+#include "macros.h"
+
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
+int
+main ()
+{
+  /* configure should already have checked that the locale is supported.  */
+  if (setlocale (LC_ALL, "") == NULL)
+    return 1;
+
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "\341\200";
+    const char *result = mbschr (input, '\341');
+    ASSERT (result == OR (NULL, input + 0));
+  }
+  {
+    const char input[] = "\341\200";
+    const char *result = mbschr (input, '\200');
+    ASSERT (result == OR (NULL, input + 1));
+  }
+  {
+    const char input[] = "\341\200\341";
+    const char *result = mbschr (input, '\341');
+    ASSERT (result == input + OR(2,0));
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "\360\221\222";
+    const char *result = mbschr (input, '\360');
+    ASSERT (result == OR (NULL, input + 0));
+  }
+  {
+    const char input[] = "\360\221\222";
+    const char *result = mbschr (input, '\221');
+    ASSERT (result == OR (NULL, input + 1));
+  }
+  {
+    const char input[] = "\360\221\222";
+    const char *result = mbschr (input, '\222');
+    ASSERT (result == OR (NULL, input + 2));
+  }
+  {
+    const char input[] = "\360\221\222\360";
+    const char *result = mbschr (input, '\360');
+    ASSERT (result == input + OR(3,0));
+  }
+  {
+    const char input[] = "\360\221";
+    const char *result = mbschr (input, '\360');
+    ASSERT (result == OR (NULL, input + 0));
+  }
+  {
+    const char input[] = "\360\221";
+    const char *result = mbschr (input, '\221');
+    ASSERT (result == OR (NULL, input + 1));
+  }
+  {
+    const char input[] = "\360\221\360";
+    const char *result = mbschr (input, '\360');
+    ASSERT (result == input + OR(2,0));
+  }
+
+  return test_exit_status;
+}
diff --git a/tests/test-mbschr1.sh b/tests/test-mbschr1.sh
new file mode 100755
index 0000000000..48e258d63e
--- /dev/null
+++ b/tests/test-mbschr1.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+# Test whether a specific UTF-8 locale is installed.
+: "${LOCALE_EN_UTF8=en_US.UTF-8}"
+: "${LOCALE_FR_UTF8=fr_FR.UTF-8}"
+if test "$LOCALE_EN_UTF8" = none && test $LOCALE_FR_UTF8 = none; then
+  if test -f /usr/bin/localedef; then
+    echo "Skipping test: no english or french Unicode locale is installed"
+  else
+    echo "Skipping test: no english or french Unicode locale is supported"
+  fi
+  exit 77
+fi
+
+# It's sufficient to test in one of the two locales.
+if test $LOCALE_FR_UTF8 != none; then
+  testlocale=$LOCALE_FR_UTF8
+else
+  testlocale="$LOCALE_EN_UTF8"
+fi
+
+LC_ALL="$testlocale" \
+${CHECKER} ./test-mbschr1${EXEEXT}
diff --git a/tests/test-mbschr.c b/tests/test-mbschr2.c
similarity index 96%
rename from tests/test-mbschr.c
rename to tests/test-mbschr2.c
index f7678eb41b..5eae208c97 100644
--- a/tests/test-mbschr.c
+++ b/tests/test-mbschr2.c
@@ -1,4 +1,4 @@
-/* Test of searching a string for a character.
+/* Test of searching a string for a character in a GB18030 locale.
    Copyright (C) 2007-2026 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
diff --git a/tests/test-mbschr.sh b/tests/test-mbschr2.sh
similarity index 90%
rename from tests/test-mbschr.sh
rename to tests/test-mbschr2.sh
index 7e62b3f08a..c75973c362 100755
--- a/tests/test-mbschr.sh
+++ b/tests/test-mbschr2.sh
@@ -12,4 +12,4 @@ if test $LOCALE_ZH_CN = none; then
 fi
 
 LC_ALL=$LOCALE_ZH_CN \
-${CHECKER} ./test-mbschr${EXEEXT}
+${CHECKER} ./test-mbschr2${EXEEXT}
diff --git a/tests/test-mbscspn.c b/tests/test-mbscspn.c
index 0fa513748f..5cc248711d 100644
--- a/tests/test-mbscspn.c
+++ b/tests/test-mbscspn.c
@@ -24,6 +24,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -57,5 +71,69 @@ main ()
     ASSERT (mbscspn (input, "\303") == 14); /* invalid multibyte sequence */
   }
 
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "\341\200\240x\341\200y";
+    ASSERT (mbscspn (input, "\341\200") == 4);
+  }
+  {
+    const char input[] = "\341\200\240x\341\200";
+    ASSERT (mbscspn (input, "\341\200") == 4);
+  }
+  {
+    const char input[] = "\341\200\240x\341\200";
+    ASSERT (mbscspn (input, "\341") == OR(6,4));
+  }
+  {
+    const char input[] = "\341\200\240x\341y";
+    ASSERT (mbscspn (input, "\341") == 4);
+  }
+  {
+    const char input[] = "\341\200\240x\341";
+    ASSERT (mbscspn (input, "\341") == 4);
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "\360\221\222\240x\360\221\222y";
+    ASSERT (mbscspn (input, "\360\221\222") == 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221\222";
+    ASSERT (mbscspn (input, "\360\221\222") == 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221\222";
+    ASSERT (mbscspn (input, "\360\221") == OR(8,5));
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221y";
+    ASSERT (mbscspn (input, "\360\221") == 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221";
+    ASSERT (mbscspn (input, "\360\221") == 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221\222";
+    ASSERT (mbscspn (input, "\360") == OR(8,5));
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221";
+    ASSERT (mbscspn (input, "\360") == OR(7,5));
+  }
+  {
+    const char input[] = "\360\221\222\240x\360y";
+    ASSERT (mbscspn (input, "\360") == 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360";
+    ASSERT (mbscspn (input, "\360") == 5);
+  }
+
   return test_exit_status;
 }
diff --git a/tests/test-mbslen.c b/tests/test-mbslen.c
index b32a74a296..9cf8673579 100644
--- a/tests/test-mbslen.c
+++ b/tests/test-mbslen.c
@@ -24,6 +24,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -39,9 +53,17 @@ main ()
   ASSERT (mbslen ("7\342\202\254") == 2); /* "7???" */
   ASSERT (mbslen ("\360\237\220\203") == 1); /* "????" */
 
-  ASSERT (mbslen ("\303") == 1); /* invalid multibyte sequence */
-  ASSERT (mbslen ("\342\202") == 2); /* 2x invalid multibyte sequence */
-  ASSERT (mbslen ("\360\237\220") == 3); /* 3x invalid multibyte sequence */
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+  ASSERT (mbslen ("\303") == 1);
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  ASSERT (mbslen ("\341\200") == OR(1,2));
+  ASSERT (mbslen ("\341") == 1);
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  ASSERT (mbslen ("\360\221\222") == OR(1,3));
+  ASSERT (mbslen ("\360\221") == OR(1,2));
+  ASSERT (mbslen ("\360") == 1);
 
   return test_exit_status;
 }
diff --git a/tests/test-mbspbrk.c b/tests/test-mbspbrk.c
index ce396eba18..a0f86d3652 100644
--- a/tests/test-mbspbrk.c
+++ b/tests/test-mbspbrk.c
@@ -24,6 +24,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -51,5 +65,69 @@ main ()
     ASSERT (mbspbrk (input, "\303") == NULL); /* invalid multibyte sequence */
   }
 
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "\341\200\240x\341\200y";
+    ASSERT (mbspbrk (input, "\341\200") == input + 4);
+  }
+  {
+    const char input[] = "\341\200\240x\341\200";
+    ASSERT (mbspbrk (input, "\341\200") == input + 4);
+  }
+  {
+    const char input[] = "\341\200\240x\341\200";
+    ASSERT (mbspbrk (input, "\341") == OR (NULL, input + 4));
+  }
+  {
+    const char input[] = "\341\200\240x\341y";
+    ASSERT (mbspbrk (input, "\341") == input + 4);
+  }
+  {
+    const char input[] = "\341\200\240x\341";
+    ASSERT (mbspbrk (input, "\341") == input + 4);
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "\360\221\222\240x\360\221\222y";
+    ASSERT (mbspbrk (input, "\360\221\222") == input + 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221\222";
+    ASSERT (mbspbrk (input, "\360\221\222") == input + 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221\222";
+    ASSERT (mbspbrk (input, "\360\221") == OR (NULL, input + 5));
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221y";
+    ASSERT (mbspbrk (input, "\360\221") == input + 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221";
+    ASSERT (mbspbrk (input, "\360\221") == input + 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221\222";
+    ASSERT (mbspbrk (input, "\360") == OR (NULL, input + 5));
+  }
+  {
+    const char input[] = "\360\221\222\240x\360\221";
+    ASSERT (mbspbrk (input, "\360") == OR (NULL, input + 5));
+  }
+  {
+    const char input[] = "\360\221\222\240x\360y";
+    ASSERT (mbspbrk (input, "\360") == input + 5);
+  }
+  {
+    const char input[] = "\360\221\222\240x\360";
+    ASSERT (mbspbrk (input, "\360") == input + 5);
+  }
+
   return test_exit_status;
 }
diff --git a/tests/test-mbsrchr1.c b/tests/test-mbsrchr1.c
new file mode 100644
index 0000000000..91c5e734f6
--- /dev/null
+++ b/tests/test-mbsrchr1.c
@@ -0,0 +1,107 @@
+/* Test of searching a string for the last occurrence of a character.
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <[email protected]>, 2026.  */
+
+#include <config.h>
+
+#include <string.h>
+
+#include <locale.h>
+
+#include "macros.h"
+
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
+int
+main ()
+{
+  /* configure should already have checked that the locale is supported.  */
+  if (setlocale (LC_ALL, "") == NULL)
+    return 1;
+
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "\341\200";
+    const char *result = mbsrchr (input, '\341');
+    ASSERT (result == OR (NULL, input + 0));
+  }
+  {
+    const char input[] = "\341\200";
+    const char *result = mbsrchr (input, '\200');
+    ASSERT (result == OR (NULL, input + 1));
+  }
+  {
+    const char input[] = "\341\200\341";
+    const char *result = mbsrchr (input, '\341');
+    ASSERT (result == input + 2);
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "\360\221\222";
+    const char *result = mbsrchr (input, '\360');
+    ASSERT (result == OR (NULL, input + 0));
+  }
+  {
+    const char input[] = "\360\221\222";
+    const char *result = mbsrchr (input, '\221');
+    ASSERT (result == OR (NULL, input + 1));
+  }
+  {
+    const char input[] = "\360\221\222";
+    const char *result = mbsrchr (input, '\222');
+    ASSERT (result == OR (NULL, input + 2));
+  }
+  {
+    const char input[] = "\360\221\222\360";
+    const char *result = mbsrchr (input, '\360');
+    ASSERT (result == input + 3);
+  }
+  {
+    const char input[] = "\360\221";
+    const char *result = mbsrchr (input, '\360');
+    ASSERT (result == OR (NULL, input + 0));
+  }
+  {
+    const char input[] = "\360\221";
+    const char *result = mbsrchr (input, '\221');
+    ASSERT (result == OR (NULL, input + 1));
+  }
+  {
+    const char input[] = "\360\221\360";
+    const char *result = mbsrchr (input, '\360');
+    ASSERT (result == input + 2);
+  }
+
+  return test_exit_status;
+}
diff --git a/tests/test-mbsrchr1.sh b/tests/test-mbsrchr1.sh
new file mode 100755
index 0000000000..ce0d000437
--- /dev/null
+++ b/tests/test-mbsrchr1.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+# Test whether a specific UTF-8 locale is installed.
+: "${LOCALE_EN_UTF8=en_US.UTF-8}"
+: "${LOCALE_FR_UTF8=fr_FR.UTF-8}"
+if test "$LOCALE_EN_UTF8" = none && test $LOCALE_FR_UTF8 = none; then
+  if test -f /usr/bin/localedef; then
+    echo "Skipping test: no english or french Unicode locale is installed"
+  else
+    echo "Skipping test: no english or french Unicode locale is supported"
+  fi
+  exit 77
+fi
+
+# It's sufficient to test in one of the two locales.
+if test $LOCALE_FR_UTF8 != none; then
+  testlocale=$LOCALE_FR_UTF8
+else
+  testlocale="$LOCALE_EN_UTF8"
+fi
+
+LC_ALL="$testlocale" \
+${CHECKER} ./test-mbsrchr1${EXEEXT}
diff --git a/tests/test-mbsrchr.c b/tests/test-mbsrchr2.c
similarity index 100%
rename from tests/test-mbsrchr.c
rename to tests/test-mbsrchr2.c
diff --git a/tests/test-mbsrchr.sh b/tests/test-mbsrchr2.sh
similarity index 90%
rename from tests/test-mbsrchr.sh
rename to tests/test-mbsrchr2.sh
index 84c40b7bf8..cce61decc6 100755
--- a/tests/test-mbsrchr.sh
+++ b/tests/test-mbsrchr2.sh
@@ -12,4 +12,4 @@ if test $LOCALE_ZH_CN = none; then
 fi
 
 LC_ALL=$LOCALE_ZH_CN \
-${CHECKER} ./test-mbsrchr${EXEEXT}
+${CHECKER} ./test-mbsrchr2${EXEEXT}
diff --git a/tests/test-mbsspn.c b/tests/test-mbsspn.c
index cce1d08dce..d2edeaa89a 100644
--- a/tests/test-mbsspn.c
+++ b/tests/test-mbsspn.c
@@ -24,6 +24,20 @@
 
 #include "macros.h"
 
+/* The mcel-based implementation of mbsnlen behaves differently than the
+   original one.  Namely, for invalid/incomplete byte sequences:
+   Where we ideally should have multi-byte-per-encoding-error (MEE) behaviour
+   everywhere, mcel implements single-byte-per-encoding-error (SEE) behaviour.
+   See <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00131.html>,
+       <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>.
+   Therefore, here we have different expected results, depending on the
+   implementation.  */
+#if GNULIB_MCEL_PREFER
+# define OR(a,b) b
+#else
+# define OR(a,b) a
+#endif
+
 int
 main ()
 {
@@ -53,5 +67,49 @@ main ()
     ASSERT (mbsspn (input, "\303") == 0); /* invalid multibyte sequence */
   }
 
+  /* Incomplete characters.  See
+     https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
+     page 128 table 3-11.  */
+
+  /* "\341\200\240" = 0xE1 0x80 0xA0 = U+1020.  */
+  {
+    const char input[] = "\341\200\341\200\240";
+    ASSERT (mbsspn (input, "\341\200") == 2);
+  }
+  {
+    const char input[] = "\341\200\341\200\240";
+    ASSERT (mbsspn (input, "\341") == OR(0,1));
+  }
+  {
+    const char input[] = "\341\341\200\240";
+    ASSERT (mbsspn (input, "\341") == 1);
+  }
+
+  /* "\360\221\222\240" = 0xF0 0x91 0x92 0xA0 = U+114A0.  */
+  {
+    const char input[] = "\360\221\222\360\221\222\240";
+    ASSERT (mbsspn (input, "\360\221\222") == 3);
+  }
+  {
+    const char input[] = "\360\221\222\360\221\222\240";
+    ASSERT (mbsspn (input, "\360\221") == OR(0,2));
+  }
+  {
+    const char input[] = "\360\221\360\221\222\240";
+    ASSERT (mbsspn (input, "\360\221") == 2);
+  }
+  {
+    const char input[] = "\360\221\222\360\221\222\240";
+    ASSERT (mbsspn (input, "\360") == OR(0,1));
+  }
+  {
+    const char input[] = "\360\221\360\221\222\240";
+    ASSERT (mbsspn (input, "\360") == OR(0,1));
+  }
+  {
+    const char input[] = "\360\360\221\222\240";
+    ASSERT (mbsspn (input, "\360") == 1);
+  }
+
   return test_exit_status;
 }
-- 
2.54.0

mb*iter*: Implement multi-byte per encoding error (MEE) consistently

Reply via email to

mbiter: Implement multi-byte per encoding error (MEE) consistently