Re: Better REs

Andy_Bach Fri, 25 Jan 2002 10:41:19 -0800

I may have a different idea of OT or perhaps just a lower threshold 
(actually I think anything to do w/ perl is fun) but one thing led to 
another and I ended up contacting the RE man himself.    I do not mean to 
post this as validation of my note (that appears moot at best and wrong 
(missed pg 278) for the uncharitable) but it is useful info, RE-wise.


a

Andy Bach, Sys. Mangler
Internet: [EMAIL PROTECTED] 
VOICE: (608) 261-5738  FAX 264-5030

Disclaimer: [see: http://www.goldmark.org/jeff/stupid-disclaimers/ ]

----- Forwarded by Andy Bach/WIWB/07/USCOURTS on 01/25/02 12:48 PM -----

|> >> /[fF][oO][oO]/ better than /foo/i

The /i penalty had nothing to do with backtracking, but with making an
extra copy of the target string. If the target string was short, it wasn't
really much of an issue. It's described on p278.

Once the book came out, it (and many other bugs) were fixed. The 2nd
edition won't even mention it, except perhaps in passing as having been
part of history.

|> >I have to disagree; while I have read the RE/Owl (I hear there is
|> >another in the works?) I feel that
|> >
|> >/[fF][oO][oO]/ lacks clarity over /foo/i

If this check was being done repeatedly on a huge string, I'd probably go
with the lack of clarity for the speed, since (as p280 shows), the penalty
could have been multiple orders of magnatiude. As for lack of clarity,
sure, but to save four orders of magnitude, I'd be willing to add a 
comment :-)

With today's Perl, I'd never do something like this. I use /i and no 
longer
even think about it.

I believe most of the work to clean these things up was done by Ilya.

|> As a matter of speed, I find that /i is better than [] in some cases

This makes sense. If you compare the minimal amount of work needed to 
check
'f' and 'F' against the current character, and to invoke the whole
character-class mechanism, one might expect the former to be faster. Also,
Boyer-Moore may be able to work in a case-insensative manner with /foo/i,
but I very much doubt the regex engine will recognize that it can do the
same with /[fF][oO][oO]/, so I can imagine cases where /foo/i is
substantially faster.


|> In light of that, I wonder why /cj+a/i yields a scan for EXACTF "c" 
(that
|> is, a "c", case-folded), instead of EXACTF "cj" (and then looks for 
zero
|> or more j's).
|> 
|> On the same string as before, I ran the following two regexes (the 
names
|> of the tests themselves).  Here are the results!
|> 
|>   Benchmark: running cj+a, cjj*a for at least 5 CPU seconds...
|>       cj+a:  6 wallclock secs ( 5.20 usr +  0.03 sys =  5.23 CPU) @
|>              44552.01/s (n=233007)
|>      cjj*a: 11 wallclock secs ( 5.19 usr +  0.00 sys =  5.19 CPU) @
|>              68402.31/s (n=355008)
|>            Rate  cj+a cjj*a
|>   cj+a  44552/s    --  -35%
|>   cjj*a 68402/s   54%    --
|> 
|> Here's a case where expanding X+ to XX* yields good results.

This is a case of one being optimized a bit better than others, and 
knowing
Jeff, he'll jump into bleedperl to fix it :-)

Feel free to repost this message, or parts of it, as you like.
      Jeffrey

Re: Better REs

Reply via email to