Hello, internals!

While I was working on a new function mb_str_split
(https://wiki.php.net/rfc/mb_str_split) for the extension mbstring, I
noticed a place to seriously improve  the mbfl library performance for
the utf-16 encoding.
Currently, all variable-length encodings are processed byte-by-byte.

for(int i = 0; i < string_length; ++i){
.......
}

utf-8 strings are processed with precounted char length table.

while (i < string_length) {
        int m = mbtab[*p];
        i += m;
        .....
}

This conception can be used for the utf-16 encoding, but table size
would be 65536 bytes against 256 byte for the utf-8 table. Moreover
the tables would be 2, one for the utf-16 big endian and 1 for the
utf-16 little endian.

The results of my tests show a more than 2 times speed increase.
The implementation of the proposed concept is here:

https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38

To do, or not to do: that is the question.
What do you think?

Regards,
Ruslan

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to