On 07/08/2021 18:57, Hans Henrik Bergan wrote:
can someone shed some light on this? why does mb_check_encoding seem to be
so much slower than the alternatives?
benchmark code+results is here https://stackoverflow.com/a/68690757/1067003


Hi Hans,

Since you ran the test on PHP 7.4, the relevant implementation is here: https://heap.space/xref/PHP-7.4/ext/mbstring/mbstring.c?r=0cafd53d#php_mb_check_encoding_impl

As you can maybe see, it takes a rather "brute force" approach: it runs the entire string through a conversion routine, and then checks (among other things) that the output is identical to the input. That makes it scale horribly with string length, with no optimization for returning false early.

The good news is that Alex Dowad has been doing a lot of work to improve ext/mbstring recently, and landed a completely new implementation for mb_check_encoding a few months ago: https://github.com/php/php-src/commit/be1a2155 although it was then changed slightly by later cleanup: https://github.com/php/php-src/commit/3e7acf90

That was too late for PHP 8.0, so I compiled an up to date git checkout, and ran your benchmark (with 100_000 iterations instead of 1_000_000; I guess my PC's a lot slower than yours!)

PHP 7.4:
mbstring: 57000 / 57100 / 56200
PCRE: 1500 / 1200 / 12400

PHP 8.1 beta:
mbstring: 35600 / 1200 / 36700
PCRE: 1400 / 1200 / 12100

So, mbstring now detects a failure at the start of the string as quickly as PCRE does, because the new algorithm has an early return, but is still slower than PCRE when it has to check the whole string.

Looking at the PCRE source, I think the relevant code is this: https://vcs.pcre.org/pcre2/code/trunk/src/pcre2_valid_utf.c?view=markup

It has the advantage of only handling a handful of encodings, and only needing to do a few operations on them. The main problem ext/mbstring has is that it supports a lot of operations, on a lot of different encodings, so it's still reusing a general purpose "convert and filter" algorithm.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to