>>>>> "AY" == Alan Young <[EMAIL PROTECTED]> writes:
AY> Updated script at bottom.
AY> On 2/23/06, Uri Guttman <[EMAIL PROTECTED]> wrote:
AY> $text =~ s{(
AY> (\b\w+(?:['-]+\w+)*\b)
>>
>> why the multiple ['-] inside the words? could those chars ever begin or
>> end words? so just [\w'-]+ should be fine there.
AY> It's possible to have multi-hyphenated words. I didn't think it was
AY> worth the time to figure out how to handle that and single apostrophe
AY> words at the same time. Besides, I'm not verifying the accuracy of
AY> the text.
AY> In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it
AY> took 40 seconds and found 's and ' as words where the original did
AY> not.
no wonder it took so long. you matched the null string between each pair
of word boundaries. you need a +, not * there.
AY> This is the way I understand it:
AY> (??{<code>}) replaces the regex at the current pos() with the result
AY> of the <code> block.
AY> If the the match ($^N) was not in the hash, then it would auto-vivify
AY> the key and increment it and return (?!) which is a negative lookahead
AY> on nothing, which always fails so we force it to backtrack and try
AY> again.
AY> If the match ( $^N) is in the hash, then it increments the value and
AY> returns (?=) which is a positive lookahead on nothing, which always
AY> succeeds so we continue on.
i understand the boolean thing as i said previously. i was asking why
you used it there. i see no reason if all you are doing is word
counting.
AY> Changing the regex to
AY> 1 while $text =~ m{(
AY> (\b\w+(?:['-]+\w+)*\b)
AY> (?{!$unique{$^N}++})
AY> )
AY> }xg;
AY> dropped the time down to 3s.
>> since you just replace the word by itself, why use s///? m// will get
>> the same results and should be much faster.
AY> There was no appreciable difference between the two types of regexes
AY> (see my code below).
try this:
$unique{$1}++ while $text =~ m/([\w'-]+)/g ;
use the benchmark module to compare the speeds. make sure you don't do
destructive parsing which some of your examples seem to to.
uri
--
Uri Guttman ------ [EMAIL PROTECTED] -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org