RE: PERL Pattern matching

Conor Lillis Tue, 07 Apr 2009 06:16:21 -0700

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Justin 
Allegakoen
Sent: 07 April 2009 13:54
To: [email protected]
Subject: Re: PERL Pattern matching


2009/4/7 Conor Lillis <[email protected]>:
> Hi all,
> I have a requirement to match a file against a number of possible 
> strings.
> I also need to retain the successfully matching elements, and based on 
> the match from the primary list, see if a corresponding secondary 
> match is also in the file.
>
> E.g.. Picture the list below
>
> A       B
> ---------
> 1       a
> 2       b
> 3       c
> 4       d
> 5       e
> 6       f
>
> I need to match any of the first column in the file, and for any 
> matches in the first row match the corresponding entry for the second column.
> So if I match 3 and 5 in the file, I rescan and see if I can match c 
> or e in the same file.
>
> Here is a snippet of how I am matching anything from column A, and 
> then rescanning for only corresponding entries from column B.
> Is there a more efficient method than what I am doing here?
>
> while(<FILE>)
> {
>        foreach my $string (@strings)
>        {
>                if (grep /$string/i, $_)
>                {
>                        $primary =++$primary;
>                        if (!grep /$string/i, 
> @matchindex){push(@matchindex, "$string");}
>                        print gmtime()."\"$file\" matched on primary - 
> $string\n";
>                }
>        }
> }
> close(FILE);
> if (@matchindex)
> {
> #       GetSecondaries() Reads array to get 2nd column entries for 
> primary matches by splitting row on seperators
>        my @matches = GetSecondaries(@matchindex);
>        open(FILE, $file);
>        LOOP: while(<FILE>)
>        {
>                foreach my $string (@matches)
>                {
>                        if (grep /$string/i, $_)
>                        {
>                                $secondarycounter = 
> ++$secondarycounter;
>                                print gmtime()."********\t: Also 
> matched on secondary - $string\n";
>                                logger("$string,$file");        # Logs 
> output to a log file
>                                last LOOP;
>                        }
>                }
> }
> close(FILE);

I doubt your data is as simplistic as you describe which is why I miss the 
point of 'rescanning' the data for a 2nd time after the first match. Based on 
your example is there a possibility that say 3 from column A could relate to 
anything other than 'c'?

Looking at your sample the i modifier in the grep in the foreach is a prime 
candidate for hashes and the exists function. Have a look at perldsc which may 
give you some ideas of how your data structure should really appear.

If that doesn't help and the others on the list can't see what you're after 
then sanitise some of the real data (I'm sure its confidential looking at your 
e-mail address ; ) and re-describe the end desired result - which most probably 
involves account numbers and the amount of times they have used their debit 
facility : p

Just in
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Justin
Thanks for the response. I am scanning randomly formatted text files to see if 
any of the elements in column A appear. For those that do appear, I am 
rescanning the file for the corresponding values in column B. In some cases 
there are multiple matches in column B for a value in column A, but I manage 
that using the function GetSecondaries() that is not included, but this element 
is working fine (accurately and quickly) from debugging it previously.

The reason for the rescan is that using the while(<FILE>) read statement I am 
already at the end of the file when I know how many matches for column A I 
have, and for all matches I have to rescan the file for the related values in 
column B. I need to know all matches for col. A before I can rescan for column 
B matches.

Part of the issue may be that I have 130 "pairs" of data to scan for. However 
the analysis on a 2 MB file is taking 70 seconds, just on column A values, so 
that suggests grepping against the $_ variable while reading the file is not 
economic. My server is a new Win2003 file server with 3GB RAM and 4 CPU so 
hardware performance should really not be a bottleneck. Perl seems to sit on 
one CPU looking at task manager, the CPU usage doesn't exceed 25% (RAM 
utilisation is also very little).

What I guess I am really hoping for is a more economic method to read the file 
and match any of the 130 primary elements of the array against the file, but 
retain the matching values for re-use in the 2nd loop. 
Perhaps grep is not the most appropriate mechanism to use for this type of 
retrieval?

It is an unusual request I agree. What I can do is generate a test data file 
and upload the whole script & data file for completeness if this is any use in 
clarifying what I am trying to achieve..

Conor


**********************************************************************
Private, Confidential and Privileged. This e-mail and any files and attachments 
transmitted with it are confidential and/or privileged. They are intended 
solely for the use of the intended recipient. The content of this e-mail and 
any file or attachment transmitted with it may have been changed or altered 
without the consent of the author. If you are not the intended recipient, 
please note that any review, dissemination, disclosure, alteration, printing, 
circulation or transmission of this e-mail and/or any file or attachment 
transmitted with it, is prohibited and may be unlawful. If you have received 
this e-mail or any file or attachment transmitted with it in error please 
notify Anglo Irish Bank Corporation Limited, Stephen Court, 18/21 St Stephen's 
Green, Dublin 2, Ireland, telephone no: +353-1-6162000. 
Directors: D O'Connor (Chairman), F Daly, A Dukes, M Keane, D Quilligan.
Registered Office: Stephen Court, 18/21 St Stephen's Green, Dublin 2 Ireland
Registered in Ireland: No 22045
Anglo Irish Bank Corporation Limited is regulated by the Financial Regulator. 
Anglo Irish Bank Corporation Limited (trading as Anglo Irish Bank Private 
Banking) is regulated by the Financial Regulator. Anglo Irish Assurance Company 
Limited is regulated by the Financial Regulator. 
**********************************************************************

_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: PERL Pattern matching

Reply via email to