A couple of questions:

1- are you closing <IN> anywhere?
2- are you doing a foreach $keyword on every line of every file? That
*could* get slow.
3- Does it start out fast and get slower as it goes? Or is it slow from the
start?



> -----Original Message-----
> From: Craig Cardimon [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 15, 2005 3:28 PM
> To: [email protected]; ActivePerl
> Subject: Keyword search is dragging
> 
> 
> I'm searching large ASCII files for keywords. The keywords 
> are part of 
> section headings. These headings are in all caps on lines by 
> themselves.
> 
> The files sometimes contain HTML tags. My logic handles this well 
> enough, but combs through the HTML very slowly. I'm dealing 
> with tens of 
> thousands of files, so speed counts.
> 
> I thought I'd get around this by using HTML::TokeParser to remove any 
> HTML before I searched each file. But now the script processes EVERY 
> file slowly, taking a few seconds for each.
> 
> Any suggestions on how I might optimize the following code, or what I 
> could be doing better?
> 
> -- Craig
> 
> 
> # slurp file into variable
> {
>       local $/;
>       $wholefile = <IN>;
> }
> 
> # remove HTML tags from variable, leaving only text
> my $parser = HTML::TokeParser->new (\$wholefile);
> while (my $token = $parser->get_token)
> {
>       next unless $token->[0] eq 'T';
>       $wholefile2 = $wholefile2 . $token->[1];
> }
> 
> foreach $keyword (@all_keywords)
> {                     
>       my $re = qr
>       {
>        ( # start of $1 variable
>         ( # start of a group                                  
>                          (\w+[A-Z])+ # one or more words in caps
>           \s+ # one or more spaces
>         )* # zero or more groups
>         $keyword # the $keyword variable
>         \s+ # one or more spaces
>         AGREEMENT # the word "AGREEMENT"
>        ) # end of $1 variable
>       }x;
>                                       
>       my $wholeRE = qr{^\s*$re\s*$};
>                               
>       if($wholefile2 =~ /$wholeRE/gm)
>       {
>               # proceed
>       }
> 
> }
> 
> 
> ---
> avast! Antivirus: Outbound message clean.
> Virus Database (VPS): 0511-0, 03/15/2005
> Tested on: 3/15/2005 3:28:18 PM
> avast! - copyright (c) 1988-2004 ALWIL Software.
> http://www.avast.com
> 
> 
> 
> _______________________________________________
> Perl-Win32-Users mailing list
> [email protected]
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
> 
> 
> 
> __________________________________________________________
> This message was scanned by ATX
> 3:32:40 PM ET - 3/15/2005
> 
_______________________________________________
Perl-Win32-Users mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to