On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals < internals@lists.php.net> wrote:
> Hi, > > I would like to raise attention to an inconsistency in how mbstring > functions handle invalid multibyte sequences. When, for example, > mb_strpos encounters a UTF-8 leading byte, it tries to parse the > following continuation bytes until the full byte sequence is read. If > an invalid byte is encountered, all previously read bytes are > considered one character, and the parsing is started over again at the > invalid byte. Let's consider the following example: > > mb_strpos("\xf0\x9fABCD", "B"); // int(2) > > The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following > byte (0x9f) is a valid continuation byte. The next byte (0x41) is not > a valid continuation byte. Thus, 0xf0 and 0x9f are considered one > character, and 0x41 is regarded as another character. Accordingly, the > resulting index of "B" is 2. > > On the other hand, mb_substr, for example, simply skips over > continuation bytes when encountering a leading byte. Let's consider > the following example, which uses mb_substr to cut the first two > characters from the string used in the previous example: > > mb_substr("\xf0\x9fABCD", 2); // string(1) "D" > > Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This > time, mb_substr just skips over the next three bytes and considers all > 4 bytes one character. Next, it continues to process at byte 0x43 > ("C"), which is regarded as another character. Thus, the resulting > string is "D". > > This inconsistency in handling invalid multibyte sequences not only > exists between different functions but also affects single functions. > Let's consider the following example, which uses mb_strstr to > determine the first occurrence of the string "B" in the same string: > > mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D" > > The principle is the same, just in a single function call. > > This inconsistency may not only lead to an unexpected behavior but can > also have a security impact when the affected functions are used to > filter input. > > > Best Regards, > Stefan Schiller > > [1]: https://www.php.net/manual/en/function.mb-strpos.php > [2]: https://www.php.net/manual/de/function.mb-substr.php > [3]: https://www.php.net/manual/de/function.mb-strstr.php > This might have been better to raise as a bug, but in any case I am CCing Alex who's the main maintainer of the mbstring extension so he's aware of this and can possibly provide some explanations. Best regards, Gina P. Banyard