Re: Small patch to improve safety of utf8_to_unicode().

Chao Li Thu, 25 Jun 2026 21:39:43 -0700

> On Jun 26, 2026, at 01:38, Jeff Davis <[email protected]> wrote:
> 
> On Thu, 2026-06-25 at 12:10 +0800, Chao Li wrote:
>> This uses the second byte, \x20, without validating. So it looks like
>> the patch prevents reading past the end of the string, but it may not
>> fully defend against invalid UTF-8 sequences.
> 
> Correct. We don't do full UTF8 validation until the last patch in the
> series, which is not being backported.
> 

Sounds like 0001 will be back patched. In that case, the commit message "defend 
against invalid UTF8” seems too broad. Does it make sense to add some brief 
description about the defend behavior to the function header comment and the 
commit message?

Then I continue to review 0002-0005:

0002 - overall looks good. A small comment is:
```
+       for (int i = offset; i > 0;)

+       for (int i = offset + ulen; i < len;)
```

As offset is of type size_t, the loop variable i is better to be size_t.

0003 - looks good.

0004 - looks good. This commit introduces a new helper utf8decode() that will 
resolve my previous concern on 0001.

0005 - Mostly looks good. This commit applies the new help and my previous 
concern is resolved. But from what you talked, I guess 0004 and 0005 will only 
be pushed to HEAD.

Just one tiny comment on 0005:
```
+       /* invalid UTF8: surrogates */
+       needed = unicode_strfold(NULL, 0, "abc\xED\xA0\x81xyz", 7, &consumed, 
false);
+       Assert(needed == 3 && consumed == 3);
```

This test passes a 10-char string but uses 7 as srclen. I know that doesn’t 
affect the test result, but it just adds unnecessary confusion to readers. So 
maybe change 7 to 10 to reflect to the real string length.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Small patch to improve safety of utf8_to_unicode().

Reply via email to