2023年12月4日(月) 22:25 Robert Landers <landers.rob...@gmail.com>: > > On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals > <internals@lists.php.net> wrote: > > > > On Sat, Dec 2, 2023 at 6:13 AM Alex <alexinbeij...@gmail.com> wrote: > > > > > > Dear Stefan, and Dear Gina, > > > > > > Thanks for the message. Yes, Stefan has rediscovered an interesting quirk > > > of the mbstring library. I have been aware of this for a long time, and > > > other mbstring developers have too. It dates back to the origin of the > > > library; actually, even before the origin of mbstring, since mbstring was > > > based on another library called libmbfl, and this behavior originates > > > from libmbfl. > > > > > > Pull your chair up around the fire and let me tell you the tale of > > > libmbfl. Once upon a time, there was a text-processing library called > > > libmbfl. libmbfl was based on a collection of text-decoding routines > > > (which converted bytes to codepoints) and text-encoding routines (which > > > converted codepoints to bytes). Each such routine was structured as a > > > stateful "filter". These filters could be assembled into "chains", > > > whereby the output values generated by one routine would automatically be > > > passed to the next. libmbfl could perform many wonderful text-processing > > > tasks by substituting a different final filter at the end of the chain. > > > > > > But all was not well. Since libmbfl's filters processed text only one > > > byte or codepoint at a time, and each routine had to save its state > > > before returning, and restore its state upon entry, libmbfl was slow. > > > Slow as a turtle, slow as a snail, slow as > > > whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to > > > do? A clever plan was hatched: give libmbfl a 256-entry table called a > > > "mblen_table" for each supported text encoding with the property that the > > > byte length of a character can be determined from its first byte. Then, > > > text-processing tasks which were not dependent on the actual content of a > > > string, but only on the number of codepoints, could be performed without > > > ever invoking those wonderful, but painfully slow filters! libmbfl could > > > skip through a string while just examining the first byte of each > > > character. (Of course, this only worked for text encodings with an > > > mblen_table.) For valid strings, the new method worked identically to the > > > previous one. For invalid strings, there were significant differences in > > > behavior, but libmbfl tried to ignore these and bravely pressed on. > > > > > > The story ends with an ironic twist. Many years later, I became > > > interested in mbstring and reimplemented its internals, replacing the > > > libmbfl code with fresh new code which ran many times faster. The new > > > code was so much faster that in some cases, the mblen_table optimization > > > actually became a pessimization! In other cases, the mblen_table-based > > > code is still faster, but not by a large amount. But now mbstring was > > > haunted by the spectre of Hyrum's Law (https://www.hyrumslaw.com/). With > > > a huge body of legacy code relying on mbstring, almost any observable > > > behavior change runs a significant risk of breaking someone's code. And > > > when this happens, they will not hesitate to vent their rage on the > > > hapless maintainers. > > > > > > Notwithstanding the rage of the users, about a year ago, I did remove the > > > mblen_table-based code in one place where benchmarks clearly showed it > > > was acting as a pessimization. I don't remember which mbstring function > > > was affected and would need to check the commit log to confirm. > > > > Hi Alex, > > > > Thank you very much for sharing this background context. > > > > > > > > Personally, I think the real issue here is not the inconsistency between > > > mbstring functions which are based on the mblen_tables and those which > > > are not. I think a lot of mbstring operations should not be used on > > > invalid strings at all, and that for such operations, mbstring would do > > > well to throw an exception if it receives invalid input. (Like mb_strpos; > > > how do you define the "position of a UTF-8 substring" when the parent > > > string is not UTF-8 at all?) But that would be a huge BC break. > > > > > > > My biggest concern is that this quirk can cause security issues in > > user code. I came across this in the first place when discovering an > > exploitable security vulnerability in an application. From my point of > > view, this is not only about inconsistent behavior but also violates > > the documentation for specific functions like mb_strstr. I agree that > > a lot of mbstring operations should not be used on invalid strings, > > and an exception seems to be an appropriate answer despite the huge BC > > impact. > > I think it is only a security issue when people accidentally think > mb_* functions should be used if it is available. I've seen people do > mb_strlen() on binary data, for example, not realizing the differences > between mb_strlen and strlen. Or using mb_* functions and then passing > them off to cryptographic functions. > > Robert Landers > Software Engineer > Utrecht NL > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php >
Hi, Internals. Sorry if I'm off topic. I don't know if it will be helpful, Japanese mbstring user if use these mb_* functions, we use mb_check_encoding. If character encoding is invalid, then occur error. ``` <?php if (mb_check_encoding($string, $encoding) !== true)) { // throw Exception or exit(not zero) } mb_strlen($string); // do something ``` Anyway, I agree if invalid encoding then mbstring is raise exception. Because not only people who are familiar with character codes. However, this is BC break, I think need more discussion. Regards Yuya -- --------------------------- Yuya Hamada (tekimen) - https://tekitoh-memdhoi.info - https://github.com/youkidearitai ----------------------------- -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php