On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals <
internals@lists.php.net> wrote:

> Hi,
>
> I would like to raise attention to an inconsistency in how mbstring
> functions handle invalid multibyte sequences. When, for example,
> mb_strpos encounters a UTF-8 leading byte, it tries to parse the
> following continuation bytes until the full byte sequence is read. If
> an invalid byte is encountered, all previously read bytes are
> considered one character, and the parsing is started over again at the
> invalid byte. Let's consider the following example:
>
> mb_strpos("\xf0\x9fABCD", "B"); // int(2)
>
> The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following
> byte (0x9f) is a valid continuation byte. The next byte (0x41) is not
> a valid continuation byte. Thus, 0xf0 and 0x9f are considered one
> character, and 0x41 is regarded as another character. Accordingly, the
> resulting index of "B" is 2.
>
> On the other hand, mb_substr, for example, simply skips over
> continuation bytes when encountering a leading byte. Let's consider
> the following example, which uses mb_substr to cut the first two
> characters from the string used in the previous example:
>
> mb_substr("\xf0\x9fABCD", 2); // string(1) "D"
>
> Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This
> time, mb_substr just skips over the next three bytes and considers all
> 4 bytes one character. Next, it continues to process at byte 0x43
> ("C"), which is regarded as another character. Thus, the resulting
> string is "D".
>
> This inconsistency in handling invalid multibyte sequences not only
> exists between different functions but also affects single functions.
> Let's consider the following example, which uses mb_strstr to
> determine the first occurrence of the string "B" in the same string:
>
> mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D"
>
> The principle is the same, just in a single function call.
>
> This inconsistency may not only lead to an unexpected behavior but can
> also have a security impact when the affected functions are used to
> filter input.
>
>
> Best Regards,
> Stefan Schiller
>
> [1]: https://www.php.net/manual/en/function.mb-strpos.php
> [2]: https://www.php.net/manual/de/function.mb-substr.php
> [3]: https://www.php.net/manual/de/function.mb-strstr.php
>

This might have been better to raise as a bug, but in any case I am CCing
Alex who's the main maintainer of the mbstring extension so he's aware of
this and can possibly provide some explanations.

Best regards,

Gina P. Banyard

Reply via email to