I may have a different idea of OT or perhaps just a lower threshold (actually I think anything to do w/ perl is fun) but one thing led to another and I ended up contacting the RE man himself. I do not mean to post this as validation of my note (that appears moot at best and wrong (missed pg 278) for the uncharitable) but it is useful info, RE-wise.
a Andy Bach, Sys. Mangler Internet: [EMAIL PROTECTED] VOICE: (608) 261-5738 FAX 264-5030 Disclaimer: [see: http://www.goldmark.org/jeff/stupid-disclaimers/ ] ----- Forwarded by Andy Bach/WIWB/07/USCOURTS on 01/25/02 12:48 PM ----- |> >> /[fF][oO][oO]/ better than /foo/i The /i penalty had nothing to do with backtracking, but with making an extra copy of the target string. If the target string was short, it wasn't really much of an issue. It's described on p278. Once the book came out, it (and many other bugs) were fixed. The 2nd edition won't even mention it, except perhaps in passing as having been part of history. |> >I have to disagree; while I have read the RE/Owl (I hear there is |> >another in the works?) I feel that |> > |> >/[fF][oO][oO]/ lacks clarity over /foo/i If this check was being done repeatedly on a huge string, I'd probably go with the lack of clarity for the speed, since (as p280 shows), the penalty could have been multiple orders of magnatiude. As for lack of clarity, sure, but to save four orders of magnitude, I'd be willing to add a comment :-) With today's Perl, I'd never do something like this. I use /i and no longer even think about it. I believe most of the work to clean these things up was done by Ilya. |> As a matter of speed, I find that /i is better than [] in some cases This makes sense. If you compare the minimal amount of work needed to check 'f' and 'F' against the current character, and to invoke the whole character-class mechanism, one might expect the former to be faster. Also, Boyer-Moore may be able to work in a case-insensative manner with /foo/i, but I very much doubt the regex engine will recognize that it can do the same with /[fF][oO][oO]/, so I can imagine cases where /foo/i is substantially faster. |> In light of that, I wonder why /cj+a/i yields a scan for EXACTF "c" (that |> is, a "c", case-folded), instead of EXACTF "cj" (and then looks for zero |> or more j's). |> |> On the same string as before, I ran the following two regexes (the names |> of the tests themselves). Here are the results! |> |> Benchmark: running cj+a, cjj*a for at least 5 CPU seconds... |> cj+a: 6 wallclock secs ( 5.20 usr + 0.03 sys = 5.23 CPU) @ |> 44552.01/s (n=233007) |> cjj*a: 11 wallclock secs ( 5.19 usr + 0.00 sys = 5.19 CPU) @ |> 68402.31/s (n=355008) |> Rate cj+a cjj*a |> cj+a 44552/s -- -35% |> cjj*a 68402/s 54% -- |> |> Here's a case where expanding X+ to XX* yields good results. This is a case of one being optimized a bit better than others, and knowing Jeff, he'll jump into bleedperl to fix it :-) Feel free to repost this message, or parts of it, as you like. Jeffrey
