2023年12月1日(金) 18:48 G. P. B. <george.bany...@gmail.com>: > > On Fri, 1 Dec 2023 at 09:31, Stefan Schiller via internals < > internals@lists.php.net> wrote: > > > Hi, > > > > I would like to raise attention to an inconsistency in how mbstring > > functions handle invalid multibyte sequences. When, for example, > > mb_strpos encounters a UTF-8 leading byte, it tries to parse the > > following continuation bytes until the full byte sequence is read. If > > an invalid byte is encountered, all previously read bytes are > > considered one character, and the parsing is started over again at the > > invalid byte. Let's consider the following example: > > > > mb_strpos("\xf0\x9fABCD", "B"); // int(2) > > > > The leading byte 0xf0 initiates a 4-byte UTF-8 sequence. The following > > byte (0x9f) is a valid continuation byte. The next byte (0x41) is not > > a valid continuation byte. Thus, 0xf0 and 0x9f are considered one > > character, and 0x41 is regarded as another character. Accordingly, the > > resulting index of "B" is 2. > > > > On the other hand, mb_substr, for example, simply skips over > > continuation bytes when encountering a leading byte. Let's consider > > the following example, which uses mb_substr to cut the first two > > characters from the string used in the previous example: > > > > mb_substr("\xf0\x9fABCD", 2); // string(1) "D" > > > > Again, the leading byte 0xf0 initiates a 4-byte UTF-8 sequence. This > > time, mb_substr just skips over the next three bytes and considers all > > 4 bytes one character. Next, it continues to process at byte 0x43 > > ("C"), which is regarded as another character. Thus, the resulting > > string is "D". > > > > This inconsistency in handling invalid multibyte sequences not only > > exists between different functions but also affects single functions. > > Let's consider the following example, which uses mb_strstr to > > determine the first occurrence of the string "B" in the same string: > > > > mb_strstr("\xf0\x9fABCD", "B"); // string(1) "D" > > > > The principle is the same, just in a single function call. > > > > This inconsistency may not only lead to an unexpected behavior but can > > also have a security impact when the affected functions are used to > > filter input. > > > > > > Best Regards, > > Stefan Schiller > > > > [1]: https://www.php.net/manual/en/function.mb-strpos.php > > [2]: https://www.php.net/manual/de/function.mb-substr.php > > [3]: https://www.php.net/manual/de/function.mb-strstr.php > > > > This might have been better to raise as a bug, but in any case I am CCing > Alex who's the main maintainer of the mbstring extension so he's aware of > this and can possibly provide some explanations. > > Best regards, > > Gina P. Banyard
Hi, > > > > I would like to raise attention to an inconsistency in how mbstring > > functions handle invalid multibyte sequences. When, for example, > > mb_strpos encounters a UTF-8 leading byte, it tries to parse the > > following continuation bytes until the full byte sequence is read. If > > an invalid byte is encountered, all previously read bytes are > > considered one character, and the parsing is started over again at the > > invalid byte. Let's consider the following example: > > > > mb_strpos("\xf0\x9fABCD", "B"); // int(2) Yes, that's true. Because mb_strpos is convert to UTF-8 in internal. However, other mbstring function is temporary convert to UTF-32, then reconvert to original character encoding. Anyway, I'll wait Alex's reply. Regards Yuya -- --------------------------- Yuya Hamada (tekimen) - https://tekitoh-memdhoi.info - https://github.com/youkidearitai ----------------------------- -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php