regex matching from a position ?

Ph. Marek Tue, 11 Feb 2003 23:41:18 -0800

Hello everybody,

I've sometimes the task to analyse a string 
starting from a given position, where this position 
changes after each iteration. (like index() does)



As this is perl there are MTOWTDIIP but I'd like 
to know the fastest.

So I used Benchmark.pm to find that out. (script attached)


Excerpt from script:
  "from_start"  => sub { m/\S*\s+(\S+)/; },
  "re_dyn"  => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/; },
  "re_once" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/o; },
  "substr" => sub { substr($_,$pos) =~ m/\S*\s+(\S+)/; },
  "substr_set" => sub { $tmp=substr($_,$pos); $tmp =~ m/\S*\s+(\S+)/; },

from_start is for comparision only as it should be.
re_once is for comparision too as the index can't be adjusted.
(and dynamically recompiling via eval() for changing indexes can't be fast enough)


Results:

2505792 bytes to do ...
Benchmark: timing 1000000 iterations of from_start, re_dyn, re_once, substr, 
substr_set...
from_start:  1 wallclock secs ( 1.26 usr + -0.01 sys =  1.25 CPU) @ 800000.00/s 
(n=1000000)
    re_dyn:  9 wallclock secs ( 6.52 usr +  0.00 sys =  6.52 CPU) @ 153374.23/s 
(n=1000000)
   re_once:  1 wallclock secs ( 1.26 usr +  0.01 sys =  1.27 CPU) @ 787401.57/s 
(n=1000000)
    substr:  4 wallclock secs ( 2.36 usr +  0.02 sys =  2.38 CPU) @ 420168.07/s 
(n=1000000)
substr_set:  5 wallclock secs ( 3.23 usr +  0.00 sys =  3.23 CPU) @ 309597.52/s 
(n=1000000)
               Rate     re_dyn substr_set     substr    re_once from_start
re_dyn     153374/s         --       -50%       -63%       -81%       -81%
substr_set 309598/s       102%         --       -26%       -61%       -61%
substr     420168/s       174%        36%         --       -47%       -47%
re_once    787402/s       413%       154%        87%         --        -2%
from_start 800000/s       422%       158%        90%         2%         --


So: every possibility is *much* slower than necessary!
So I propose (I know that I'm a bit late, but who cares ... :-) 
a new option for regexes (like each, case-insensitive, 
and match- multiple-times) which allows to specify a 
position to start matching. That should be *no* overhead!
eg:
        $text.m:from500:i /\s*(\S+)/;


Currently the substr() is the fastest available option - unless somebody
has more imagination than me (which I take as given).

So, is there a faster possibility, is that no problem for perl6, 
or will something like this be implemented?



Regards,

Phil

#!/usr/bin/perl

use Benchmark qw(cmpthese);


$pos=500;
$runs=1000000;
$_=`cat /etc/* 2> /dev/null`;
study $_;

print length($_), " bytes to do ...\n";

cmpthese($runs,
{
  "from_start"  => sub { m/\S*\s+(\S+)/; },
  "re_dyn"  => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/; },
  "re_once" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/o; },
  "substr" => sub { substr($_,$pos) =~ m/\S*\s+(\S+)/; },
  "substr_set" => sub { $tmp=substr($_,$pos); $tmp =~ m/\S*\s+(\S+)/; },
}
);

regex matching from a position ?

Reply via email to