RE: PERL Pattern matching

Brian Raven Tue, 07 Apr 2009 09:38:27 -0700

Conor Lillis <> wrote:
> Hi all,
> I have a requirement to match a file against a number of possible
> strings. 
> I also need to retain the successfully matching elements, and based
> on the match from the primary list, see if a corresponding secondary
> match is also in the file.  
> 
> E.g.. Picture the list below
> 
> A     B
> ---------
> 1     a
> 2     b
> 3     c
> 4     d
> 5     e
> 6     f
> 
> I need to match any of the first column in the file, and for any
> matches in the first row match the corresponding entry for the second
> column.  
> So if I match 3 and 5 in the file, I rescan and see if I can match c
> or e in the same file. 
> 
> Here is a snippet of how I am matching anything from column A, and
> then rescanning for only corresponding entries from column B. 
> Is there a more efficient method than what I am doing here?


Probably, but if it works, and is fast enough, why bother.

So, assuming that it isn't fast enough (not sure if it works, as I am
not sure if I understand what you are trying to do), here is where I get
to pick holes in your code. (:-) in cast it isn't obvious).

First, I realise that this is just a code fragment, but you do have "use
strict; use warnings;" at the start, don't you? If not consider this the
first comment. Note that Outlook has decided to wrap some of your lines.

> 
> while(<FILE>)
> {
>       foreach my $string (@strings)
>       {
>               if (grep /$string/i, $_)

grep is wasted here. "if (/$string/i)" is more Perl-ish. Also it would
probably be more efficient to perform a single match for the whole set
of strings, than multiple matches for each one. See below.

>               {
>                       $primary =++$primary;

Weird! What's wrong with just "++$primary;". That's the equivalent of
saying "$primary += 1; $primary = $primary".

>                       if (!grep /$string/i,
> @matchindex){push(@matchindex, "$string");}

That's more appropriate use of grep, but your quoting of $string is
unnecessary. Useless quoting is generally frowned upon.

>                       print gmtime()."\"$file\" matched on primary -
$string\n";
>               }
>       }
> }
> close(FILE);

No need to close the file here.

> if (@matchindex)
> {
> #     GetSecondaries() Reads array to get 2nd column entries for
> primary matches by splitting row on seperators
>       my @matches = GetSecondaries(@matchindex);
>       open(FILE, $file);

You could just seek back to the beginning of the file rather than open
the file again. Also, you should always check that the open call was
successful.

>       LOOP: while(<FILE>)
>       {
>               foreach my $string (@matches)
>               {
>                       if (grep /$string/i, $_)
>                       {
>                               $secondarycounter = ++$secondarycounter;

As above.

>                               print gmtime()."********\t: Also matched
on secondary -
>                               $string\n"; logger("$string,$file");
#
> Logs output to a log file
>                               last LOOP;

Are you sure about that? You exit from the loop after the very first
secondary match. What about the others?

>                       }
>               }
> }
> close(FILE);

If my guess about what you are trying to do is right, the following
might be a bit more efficient.

-------------------------------------------------------
use strict;
use warnings;
use Fcntl qw{:seek};

my %match_table = (1 => "a",
                   2 => "b",
                   3 => "c",
                   4 => "d",
                   5 => "e",
                   6 => "f");

my $fn = shift;
die "Expecting filename\n" unless $fn;

open my $fd, "<", $fn or die "Failed to open $fn: $!\n";

my @primaries = find_matches($fd, sort keys %match_table);
@primaries > 0 or die "No primaries found in $fn\n";
print "Found primaries: @primaries\n";

seek $fd, 0, SEEK_SET or die "Failed to seek: $!\n";
$. = undef;

my @secondaries = find_matches($fd, sort @match_tab...@primaries});
@secondaries > 0 or die "No secondaries found\n";
print "Found secondaries: @secondaries\n";

close $fd;

sub find_matches {
    my $fd = shift;
    my @strings = @_;
    my @matches;
    return () unless @strings > 0;
    my $matchRE = makeRE(@strings);
    print "Start looking at line $.\n";
    while (<$fd>) {
        if (/($matchRE)/i) {
            my $match_str = $1;
            push @matches, $match_str;
            @strings = grep {$_ ne $match_str} @strings;
            $matchRE = makeRE(@strings);
            # No point continuing if nothing left to look for
            last unless $matchRE;
        }
    }
    print "Stopped looking at line $.\n";
    return @matches;
}

# Make regexp to match a list of words.
sub makeRE {
    return undef unless @_ > 0;
    my $str = join "|", map {"(?:$_)"} @_;
    return qr{$str};
}
-------------------------------------------------------

You may be able to do better with a single pass through the file,
depending upon the size of your file and the table of strings you are
looking for.

HTH

-- 
Brian Raven 

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient or have received this e-mail in error, please advise 
the sender immediately by reply e-mail and delete this message and any 
attachments without retaining a copy.
Any unauthorised copying, disclosure or distribution of the material in this 
e-mail is strictly forbidden.

_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: PERL Pattern matching

Reply via email to