Hello, internals.
As Rowan Collins suggested i've replaced lookup table with simple macros:
#define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800)
#define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)
I repeated the benchmarks again. Here is the results:
String foobar was repeated 1000000 times. Result string size is 11.4mb
mb_str_split(): string was splitted by 50 into 120000 chunks 1 in
0.400670 s
mb_str_split_utf16(): string was splitted by 50 into 120000 chunks 1 in
0.038947 s
I satisfied my research interest. The question is there practical value?
Interested in your opinion.
php benchmark code:
<?php
/**
* benchmark function for scoring function perfomance by cycling it given
times
* bmark(int $rounds, string $function, mixed $arg [, mixed $... ] ): ?float
*/
function bmark(): ?float
{
$args = func_get_args();
$len = count($args);
if ($len < 3) {
trigger_error("At least 3 args expected. Only $len given.", 256);
return null;
}
$cnt = array_shift($args);
$fun = array_shift($args);
$start = microtime(true);
$i = 0;
while ($i < $cnt) {
++$i;
$res = call_user_func_array($fun, $args);
}
$end = microtime(true) - $start;
return $end;
}
/* this function to convert data size value in bytes to the best unit of
measurement */
function convert($size){
if ($size == 0) {
return 0;
}
$unit = array('b', 'kb', 'mb', 'gb', 'tb', 'pb');
$i = (int)floor(log($size, 1024));
return round($size / pow(1024, $i), 1) . $unit[$i];
}
$string = "foobar";
$utf16 = mb_convert_encoding($string,"UTF-16");
$k = 1e6;
$long = str_repeat($utf16, $k);
$size = convert(strlen($long));
$rounds = 1;
$split_length = 50;
echo "String $string was repeated $k times. Result string size is $size\n";
printf("mb_str_split(): string was splitted by %d into %d chunks %d
in %f s\n"
, $split_length
, count(mb_str_split($long, $split_length, "UTF-16"))
, $rounds
, bmark($rounds, "mb_str_split", $long, $split_length, "UTF-16")
);
printf("mb_str_split_utf16(): string was splitted by %d into %d chunks %d
in %f s\n"
, $split_length
, count(mb_str_split_utf16($long, $split_length, "UTF-16"))
, $rounds
, bmark($rounds, "mb_str_split_utf16", $long, $split_length, "UTF-16")
);
On Mon, 11 Feb 2019 at 18:00, Dan Ackroyd <[email protected]> wrote:
> On Sun, 10 Feb 2019 at 12:29, Legale Legage <[email protected]>
> wrote:
> >
> >
> >
> https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38
> >
> > To do, or not to do: that is the question.
> > What do you think?
>
> Opening separate pull requests for separate changes is good as it
> allows them to be discussed separately. That change is bundled with
> the mb_str_split() changes, so it's quite hard to see what is
> optimisation and what is part of the approved RFC.
>
> Although memory is cheap, the change appears to increase the static
> allocation of memory by 128KB for something that >95% of PHP
> programmers will never use, which is not a good idea.
>
> > show a more than 2 times speed increase.
>
> Lies, damn lies and statistics.
>
> If it takes the time to parse a megabyte string from 0.000002 to
> 0.000001, no one cares.
> If it takes the time to parse a megabyte string from 2 seconds to 1
> second, wow that's great!
>
> i.e. Saying a two times speed increase without context doesn't give
> people enough information to evaluate it.
>
> But this would be easier to discuss as a separate PR.
>
> cheers
> Dan
>