-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Justin Allegakoen Sent: 07 April 2009 13:54 To: [email protected] Subject: Re: PERL Pattern matching
2009/4/7 Conor Lillis <[email protected]>: > Hi all, > I have a requirement to match a file against a number of possible > strings. > I also need to retain the successfully matching elements, and based on > the match from the primary list, see if a corresponding secondary > match is also in the file. > > E.g.. Picture the list below > > A B > --------- > 1 a > 2 b > 3 c > 4 d > 5 e > 6 f > > I need to match any of the first column in the file, and for any > matches in the first row match the corresponding entry for the second column. > So if I match 3 and 5 in the file, I rescan and see if I can match c > or e in the same file. > > Here is a snippet of how I am matching anything from column A, and > then rescanning for only corresponding entries from column B. > Is there a more efficient method than what I am doing here? > > while(<FILE>) > { > foreach my $string (@strings) > { > if (grep /$string/i, $_) > { > $primary =++$primary; > if (!grep /$string/i, > @matchindex){push(@matchindex, "$string");} > print gmtime()."\"$file\" matched on primary - > $string\n"; > } > } > } > close(FILE); > if (@matchindex) > { > # GetSecondaries() Reads array to get 2nd column entries for > primary matches by splitting row on seperators > my @matches = GetSecondaries(@matchindex); > open(FILE, $file); > LOOP: while(<FILE>) > { > foreach my $string (@matches) > { > if (grep /$string/i, $_) > { > $secondarycounter = > ++$secondarycounter; > print gmtime()."********\t: Also > matched on secondary - $string\n"; > logger("$string,$file"); # Logs > output to a log file > last LOOP; > } > } > } > close(FILE); I doubt your data is as simplistic as you describe which is why I miss the point of 'rescanning' the data for a 2nd time after the first match. Based on your example is there a possibility that say 3 from column A could relate to anything other than 'c'? Looking at your sample the i modifier in the grep in the foreach is a prime candidate for hashes and the exists function. Have a look at perldsc which may give you some ideas of how your data structure should really appear. If that doesn't help and the others on the list can't see what you're after then sanitise some of the real data (I'm sure its confidential looking at your e-mail address ; ) and re-describe the end desired result - which most probably involves account numbers and the amount of times they have used their debit facility : p Just in _______________________________________________ ActivePerl mailing list [email protected] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs Justin Thanks for the response. I am scanning randomly formatted text files to see if any of the elements in column A appear. For those that do appear, I am rescanning the file for the corresponding values in column B. In some cases there are multiple matches in column B for a value in column A, but I manage that using the function GetSecondaries() that is not included, but this element is working fine (accurately and quickly) from debugging it previously. The reason for the rescan is that using the while(<FILE>) read statement I am already at the end of the file when I know how many matches for column A I have, and for all matches I have to rescan the file for the related values in column B. I need to know all matches for col. A before I can rescan for column B matches. Part of the issue may be that I have 130 "pairs" of data to scan for. However the analysis on a 2 MB file is taking 70 seconds, just on column A values, so that suggests grepping against the $_ variable while reading the file is not economic. My server is a new Win2003 file server with 3GB RAM and 4 CPU so hardware performance should really not be a bottleneck. Perl seems to sit on one CPU looking at task manager, the CPU usage doesn't exceed 25% (RAM utilisation is also very little). What I guess I am really hoping for is a more economic method to read the file and match any of the 130 primary elements of the array against the file, but retain the matching values for re-use in the 2nd loop. Perhaps grep is not the most appropriate mechanism to use for this type of retrieval? It is an unusual request I agree. What I can do is generate a test data file and upload the whole script & data file for completeness if this is any use in clarifying what I am trying to achieve.. Conor ********************************************************************** Private, Confidential and Privileged. This e-mail and any files and attachments transmitted with it are confidential and/or privileged. They are intended solely for the use of the intended recipient. The content of this e-mail and any file or attachment transmitted with it may have been changed or altered without the consent of the author. If you are not the intended recipient, please note that any review, dissemination, disclosure, alteration, printing, circulation or transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. If you have received this e-mail or any file or attachment transmitted with it in error please notify Anglo Irish Bank Corporation Limited, Stephen Court, 18/21 St Stephen's Green, Dublin 2, Ireland, telephone no: +353-1-6162000. Directors: D O'Connor (Chairman), F Daly, A Dukes, M Keane, D Quilligan. Registered Office: Stephen Court, 18/21 St Stephen's Green, Dublin 2 Ireland Registered in Ireland: No 22045 Anglo Irish Bank Corporation Limited is regulated by the Financial Regulator. Anglo Irish Bank Corporation Limited (trading as Anglo Irish Bank Private Banking) is regulated by the Financial Regulator. Anglo Irish Assurance Company Limited is regulated by the Financial Regulator. ********************************************************************** _______________________________________________ ActivePerl mailing list [email protected] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
