Characters in the U+100000..U+10FFFF range begin with the byte 0xF4.

This is the Private Use Area, so not super important, but I think that
readline should accept it since it is still a valid UTF-8 string.

Here is a test program using GNU libunistring, for example:

    $ cat main.c 
    #define _GNU_SOURCE 1
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistr.h>
    
    static int
    _rl_utf8_mblen (const char *s, size_t n)
    {
       /* Copied from mbutil.c.  */
    }
    
    int
    main (void)
    {
      const char str[] = "\xf4\x80\x80\x80";
      printf ("%d %d\n", _rl_utf8_mblen (str, 4), u8_mblen (str, 4));
      return 0;
    }
    $ gcc main.c -lunistring
    $ ./a.out 
    -1 4

Patch attached.

Collin

>From e04af62ffd3964428398d6be7ff99c7976f7d559 Mon Sep 17 00:00:00 2001
Message-ID: <e04af62ffd3964428398d6be7ff99c7976f7d559.1751697599.git.collin.fu...@gmail.com>
From: Collin Funk <[email protected]>
Date: Fri, 4 Jul 2025 23:34:40 -0700
Subject: [PATCH] Accept UTF-8 strings that start with 0xF4 as the first byte.

* mbutil.c (_rl_utf8_mblen): Use '<=' instead of '<' when comparing the
first byte to 0xF4.
---
 mbutil.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mbutil.c b/mbutil.c
index 5243fd7..3634c9b 100644
--- a/mbutil.c
+++ b/mbutil.c
@@ -121,7 +121,7 @@ _rl_utf8_mblen (const char *s, size_t n)
 		return 3;
 	    }
 	}
-      else if (c < 0xf4)
+      else if (c <= 0xf4)
 	{
 	  if (n == 1)
 	    return -2;
-- 
2.50.0

Reply via email to