Characters in the U+100000..U+10FFFF range begin with the byte 0xF4.
This is the Private Use Area, so not super important, but I think that
readline should accept it since it is still a valid UTF-8 string.
Here is a test program using GNU libunistring, for example:
$ cat main.c
#define _GNU_SOURCE 1
#include <stdio.h>
#include <stdlib.h>
#include <unistr.h>
static int
_rl_utf8_mblen (const char *s, size_t n)
{
/* Copied from mbutil.c. */
}
int
main (void)
{
const char str[] = "\xf4\x80\x80\x80";
printf ("%d %d\n", _rl_utf8_mblen (str, 4), u8_mblen (str, 4));
return 0;
}
$ gcc main.c -lunistring
$ ./a.out
-1 4
Patch attached.
Collin
>From e04af62ffd3964428398d6be7ff99c7976f7d559 Mon Sep 17 00:00:00 2001
Message-ID: <e04af62ffd3964428398d6be7ff99c7976f7d559.1751697599.git.collin.fu...@gmail.com>
From: Collin Funk <[email protected]>
Date: Fri, 4 Jul 2025 23:34:40 -0700
Subject: [PATCH] Accept UTF-8 strings that start with 0xF4 as the first byte.
* mbutil.c (_rl_utf8_mblen): Use '<=' instead of '<' when comparing the
first byte to 0xF4.
---
mbutil.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mbutil.c b/mbutil.c
index 5243fd7..3634c9b 100644
--- a/mbutil.c
+++ b/mbutil.c
@@ -121,7 +121,7 @@ _rl_utf8_mblen (const char *s, size_t n)
return 3;
}
}
- else if (c < 0xf4)
+ else if (c <= 0xf4)
{
if (n == 1)
return -2;
--
2.50.0