On Wed, Oct 02, 2002 at 10:39:17AM +0300, Markus Laire wrote:
> On 1 Oct 2002 at 18:47, [EMAIL PROTECTED] wrote:
>
> > > > all text up to, but not including the string "union".
> > >
> > > rule getstuffbeforeunion { (.*?) union | (.*) }
> > >
> > > "a union" => "a "
> > > "b" => "b"
> >
> > hmm... well, it works, but its not very efficient. It basically
> > scans the whole string to the end to see if there is a "union" string, and
> > then backtracks to take the alternative. And hence, its not very scalable.
> > It also doesn't 'complexify' very well.
>
> What about
>
> Perl 5: /(.*?)(?:union|$)/
> Perl 6: /(.*?) [union | $$]/
>
> or if you want to exlude 'union' from match
>
> Perl 5: /(.*?)(?=union|$)/
> Perl 6: /(.*?) [<after: union> | $$]/
>
that's exceedingly slow, at least by my benchmark. So far, I've got 4
possibilities:
my $regex1 = qr{(?:(?!union).)*}sx;
my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;
my $regex3 = qr{(?:[^u]+|(?!union).)*}sx;
my $regex4 = qr{(.*?)(?=union|$)}sx;
timethese
(
100000,
{
'questionbang' => sub { ($line =~ m"($regex1)"); },
'questionbang2' => sub { ($line =~ m"($regex3)"); },
'alternation' => sub { ($line =~ m"($regex2)"); }
'nongreedy' => sub { ($line =~ m"($regex4)"); },
}
);
which come out:
alternation: 8 wallclock secs ( 7.71 usr + 0.00 sys = 7.71 CPU) @ 12970.17/s
(n=100000)
questionbang: 17 wallclock secs (16.05 usr + 0.00 sys = 16.05 CPU) @ 6230.53/s
(n=100000)
questionbang2: 8 wallclock secs ( 7.74 usr + 0.00 sys = 7.74 CPU) @ 12919.90/s
(n=100000)
nongreedy: 41 wallclock secs (41.74 usr + 0.00 sys = 41.74 CPU) @ 2395.78/s (n=100000)
So yes, a form can be constructed out of ?! which is of approximately equal
speed to the alternation.
However, in straight C, the corresponding time is:
2.31u 0.02s 0:02.37 98.3%
which tells me that a lot of optimisation could be made with a generic
mechanism for (non)matching multi-byte character classes. The problem has
to be dealt with anyways when considering unicode... And which form would people
rather type:
(<-[^u]>+|(?!union).)*
or
<-[^'union']>*
I'd say the second scores over the first in intuition, if nothing else...
Ed