RE: PERL Pattern matching

Conor Lillis Wed, 08 Apr 2009 07:22:54 -0700

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Brian
Raven
Sent: 07 April 2009 17:38
To: [email protected]
Subject: RE: PERL Pattern matching

Conor Lillis <> wrote:
> Hi all,
> I have a requirement to match a file against a number of possible 
> strings.
> I also need to retain the successfully matching elements, and based on

> the match from the primary list, see if a corresponding secondary 
> match is also in the file.
> 
> E.g.. Picture the list below
> 
> A     B
> ---------
> 1     a
> 2     b
> 3     c
> 4     d
> 5     e
> 6     f
> 
> I need to match any of the first column in the file, and for any 
> matches in the first row match the corresponding entry for the second 
> column.
> So if I match 3 and 5 in the file, I rescan and see if I can match c 
> or e in the same file.
> 
> Here is a snippet of how I am matching anything from column A, and 
> then rescanning for only corresponding entries from column B.
> Is there a more efficient method than what I am doing here?

Probably, but if it works, and is fast enough, why bother.

So, assuming that it isn't fast enough (not sure if it works, as I am
not sure if I understand what you are trying to do), here is where I get
to pick holes in your code. (:-) in cast it isn't obvious).

First, I realise that this is just a code fragment, but you do have "use
strict; use warnings;" at the start, don't you? If not consider this the
first comment. Note that Outlook has decided to wrap some of your lines.

> 
> while(<FILE>)
> {
>       foreach my $string (@strings)
>       {
>               if (grep /$string/i, $_)

grep is wasted here. "if (/$string/i)" is more Perl-ish. Also it would
probably be more efficient to perform a single match for the whole set
of strings, than multiple matches for each one. See below.

>               {
>                       $primary =++$primary;

Weird! What's wrong with just "++$primary;". That's the equivalent of
saying "$primary += 1; $primary = $primary".

>                       if (!grep /$string/i,
> @matchindex){push(@matchindex, "$string");}

That's more appropriate use of grep, but your quoting of $string is
unnecessary. Useless quoting is generally frowned upon.

>                       print gmtime()."\"$file\" matched on primary -
$string\n";
>               }
>       }
> }
> close(FILE);

No need to close the file here.

> if (@matchindex)
> {
> #     GetSecondaries() Reads array to get 2nd column entries for
> primary matches by splitting row on seperators
>       my @matches = GetSecondaries(@matchindex);
>       open(FILE, $file);

You could just seek back to the beginning of the file rather than open
the file again. Also, you should always check that the open call was
successful.

>       LOOP: while(<FILE>)
>       {
>               foreach my $string (@matches)
>               {
>                       if (grep /$string/i, $_)
>                       {
>                               $secondarycounter = ++$secondarycounter;

As above.

>                               print gmtime()."********\t: Also matched
on secondary -
>                               $string\n"; logger("$string,$file");
#
> Logs output to a log file
>                               last LOOP;

Are you sure about that? You exit from the loop after the very first
secondary match. What about the others?

>                       }
>               }
> }
> close(FILE);

If my guess about what you are trying to do is right, the following
might be a bit more efficient.

-------------------------------------------------------
use strict;
use warnings;
use Fcntl qw{:seek};

my %match_table = (1 => "a",
                   2 => "b",
                   3 => "c",
                   4 => "d",
                   5 => "e",
                   6 => "f");

my $fn = shift;
die "Expecting filename\n" unless $fn;

open my $fd, "<", $fn or die "Failed to open $fn: $!\n";

my @primaries = find_matches($fd, sort keys %match_table); @primaries >
0 or die "No primaries found in $fn\n"; print "Found primaries:
@primaries\n";

seek $fd, 0, SEEK_SET or die "Failed to seek: $!\n"; $. = undef;

my @secondaries = find_matches($fd, sort @match_tab...@primaries});
@secondaries > 0 or die "No secondaries found\n"; print "Found
secondaries: @secondaries\n";

close $fd;

sub find_matches {
    my $fd = shift;
    my @strings = @_;
    my @matches;
    return () unless @strings > 0;
    my $matchRE = makeRE(@strings);
    print "Start looking at line $.\n";
    while (<$fd>) {
        if (/($matchRE)/i) {
            my $match_str = $1;
            push @matches, $match_str;
            @strings = grep {$_ ne $match_str} @strings;
            $matchRE = makeRE(@strings);
            # No point continuing if nothing left to look for
            last unless $matchRE;
        }
    }
    print "Stopped looking at line $.\n";
    return @matches;
}

# Make regexp to match a list of words.
sub makeRE {
    return undef unless @_ > 0;
    my $str = join "|", map {"(?:$_)"} @_;
    return qr{$str};
}
-------------------------------------------------------

You may be able to do better with a single pass through the file,
depending upon the size of your file and the table of strings you are
looking for.

HTH

--
Brian Raven 

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient or have received this e-mail in
error, please advise the sender immediately by reply e-mail and delete
this message and any attachments without retaining a copy.
Any unauthorised copying, disclosure or distribution of the material in
this e-mail is strictly forbidden.

_______________________________________________

Brian
Thanks for that, your makeRE function had great benefit performance
wise, but your other pointers are also appreciated, aswell as the time
you took to write the script at the bottom.
I got the scan on a sample file from 80 seconds down to 5 using this and
other tuning you suggested. I have 140 strings in column A, aswell as
multiple possible match strings in column B per row, which is why I
couldn't use a hash.

My input file is more like :-
Column A    Column B
--------------------
[White]     [dog]
[Black]     [cat]
[Grey]      [fox,cow] (match on grey and either fox or cow)

Thanks also to Justin and Martin for responding, all have provided
useful tips!

Regards,
Conor Lillis
---------------------------------------
Senior Systems Administrator,
Group Network Services,
tel : +353-1-616-2540

Anglo Irish Bank
www.angloirishbank.ie
---------------------------------------

**********************************************************************
Private, Confidential and Privileged. This e-mail and any files and attachments 
transmitted with it are confidential and/or privileged. They are intended 
solely for the use of the intended recipient. The content of this e-mail and 
any file or attachment transmitted with it may have been changed or altered 
without the consent of the author. If you are not the intended recipient, 
please note that any review, dissemination, disclosure, alteration, printing, 
circulation or transmission of this e-mail and/or any file or attachment 
transmitted with it, is prohibited and may be unlawful. If you have received 
this e-mail or any file or attachment transmitted with it in error please 
notify Anglo Irish Bank Corporation Limited, Stephen Court, 18/21 St Stephen's 
Green, Dublin 2, Ireland, telephone no: +353-1-6162000. 
Directors: D O'Connor (Chairman), F Daly, A Dukes, M Keane, D Quilligan.
Registered Office: Stephen Court, 18/21 St Stephen's Green, Dublin 2 Ireland
Registered in Ireland: No 22045
Anglo Irish Bank Corporation Limited is regulated by the Financial Regulator. 
Anglo Irish Bank Corporation Limited (trading as Anglo Irish Bank Private 
Banking) is regulated by the Financial Regulator. Anglo Irish Assurance Company 
Limited is regulated by the Financial Regulator. 
**********************************************************************

_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: PERL Pattern matching

Reply via email to