Hello everybody,
I've sometimes the task to analyse a string
starting from a given position, where this position
changes after each iteration. (like index() does)
As this is perl there are MTOWTDIIP but I'd like
to know the fastest.
So I used Benchmark.pm to find that out. (script attached)
Excerpt from script:
"from_start" => sub { m/\S*\s+(\S+)/; },
"re_dyn" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/; },
"re_once" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/o; },
"substr" => sub { substr($_,$pos) =~ m/\S*\s+(\S+)/; },
"substr_set" => sub { $tmp=substr($_,$pos); $tmp =~ m/\S*\s+(\S+)/; },
from_start is for comparision only as it should be.
re_once is for comparision too as the index can't be adjusted.
(and dynamically recompiling via eval() for changing indexes can't be fast enough)
Results:
2505792 bytes to do ...
Benchmark: timing 1000000 iterations of from_start, re_dyn, re_once, substr,
substr_set...
from_start: 1 wallclock secs ( 1.26 usr + -0.01 sys = 1.25 CPU) @ 800000.00/s
(n=1000000)
re_dyn: 9 wallclock secs ( 6.52 usr + 0.00 sys = 6.52 CPU) @ 153374.23/s
(n=1000000)
re_once: 1 wallclock secs ( 1.26 usr + 0.01 sys = 1.27 CPU) @ 787401.57/s
(n=1000000)
substr: 4 wallclock secs ( 2.36 usr + 0.02 sys = 2.38 CPU) @ 420168.07/s
(n=1000000)
substr_set: 5 wallclock secs ( 3.23 usr + 0.00 sys = 3.23 CPU) @ 309597.52/s
(n=1000000)
Rate re_dyn substr_set substr re_once from_start
re_dyn 153374/s -- -50% -63% -81% -81%
substr_set 309598/s 102% -- -26% -61% -61%
substr 420168/s 174% 36% -- -47% -47%
re_once 787402/s 413% 154% 87% -- -2%
from_start 800000/s 422% 158% 90% 2% --
So: every possibility is *much* slower than necessary!
So I propose (I know that I'm a bit late, but who cares ... :-)
a new option for regexes (like each, case-insensitive,
and match- multiple-times) which allows to specify a
position to start matching. That should be *no* overhead!
eg:
$text.m:from500:i /\s*(\S+)/;
Currently the substr() is the fastest available option - unless somebody
has more imagination than me (which I take as given).
So, is there a faster possibility, is that no problem for perl6,
or will something like this be implemented?
Regards,
Phil
#!/usr/bin/perl
use Benchmark qw(cmpthese);
$pos=500;
$runs=1000000;
$_=`cat /etc/* 2> /dev/null`;
study $_;
print length($_), " bytes to do ...\n";
cmpthese($runs,
{
"from_start" => sub { m/\S*\s+(\S+)/; },
"re_dyn" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/; },
"re_once" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/o; },
"substr" => sub { substr($_,$pos) =~ m/\S*\s+(\S+)/; },
"substr_set" => sub { $tmp=substr($_,$pos); $tmp =~ m/\S*\s+(\S+)/; },
}
);