Re: [Boston.pm] Q: giant-but-simple regex efficiency

Martyn Peck Sat, 05 Feb 2011 12:00:54 -0800

hi
Ok, I've been reading over the responses you've been getting and I just
have to ask everyone.


What's wrong with something like this:

        while($line=<>){
            foreach my $name (@names){
                $line ~= s/$name/prefix_$1/g;
            }
        }

I know it seems kind of brute force, looping through the same line 6000
times, but that's essentially what you where doing within the regex.

My understanding is that regexs can do powerful and complex things, but
that that also makes them slow. And since this isn't actually all that
complex, most of the looping should be done in perl itself, and not in
the regex.

If I'm wrong or missing something, let me know.
thanks
marty


On Fri, 2011-02-04 at 18:43 -0500, Kripa Sundar wrote:
> Hi folks,
> 
> Problem:
> I have a 900 Meg text file, containing random text.  I also have a list
> of 6000 names (alphanumeric strings) that occur in the random text.
> I need to tag a prefix on to each occurrence of each of these 6000
> names.
> 
> My premise:
> I believe a regex would give the simplest and most efficient algorithm.
> If I am mistaken, I would be happy to learn.
> 
> Solution attempt:
> I built a large-but-simple regex, consisting of all the names in
> alternation.  I applied this regex to each input line.
> 
> My code:
> 
>   1: my @names = [...];  # my 6000 names.
>   2: my $regex = join "|", @names;
>   3: $regex = qr/\b($regex)\b/;
>   4: 
>   5: # Read the input, and write out to all the copies simultaneously.
>   6: while (<>) {
>   7:     s/$regex/prefix_$1/g;
>   8: }
> 
> Turnaround time:
> My seat-of-the-pants guess was that my code would run for 4-5 hours,
> on a 2.4GHz AMD Opteron CPU.
> 
> But I found that I was pushing through less than 1% of the input per
> hour.  So, my full run would have taken >100 hours.
> 
> I saw this poor throughput.  I thought sorting the names would help
> the Perl regex compiler produce more efficient code. 
> So I changed line 2 to:
> 
>   2: my $regex = join "|", sort @names;
> 
> That was a tiny fraction faster, but I still estimate that my run would
> have taken 100 hours or more.
> 
> Is there a simple efficient solution that I am overlooking?
> Is there any obvious inefficiency in my approach?
> 
> peace,          || Finding gifts that do not harm:
> --{kr.pA}       || http://www.dailygood.org/more.php?n=3159


_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to