Re: Comparing two files of 8million lines/rows ...

Mumia W. Wed, 16 Aug 2006 23:31:27 -0700

On 08/16/2006 04:35 PM, [EMAIL PROTECTED] wrote:

Hi all,


I have two database tables, one is local and one is on a WAN. They are supposed
to be in-sync but they at the moment, they are not. There are 8million+ plus
rows on this table.

I tried to do SELECT EMPNO FROM EMP WHERE EMPNO NOT IN ( SELECT EMPNO FROM
[EMAIL PROTECTED] ), leave that running for hours and all thru the night and 
guess
what, am still waiting for the result to come out ...

So what I decided is I want to extract the records into a flat file and then
write a Perl script to skim thru each line and check whether it exists on the
other file. While there will be 8million+ lines, the file will not be big
beacuse am only extracting one-column from the table.

Does anyone have an existing Perl code that does a similar thing like this
already? It will be much appreciated if you can send it to me and then I will
just modify it accordingly.

Example logic that I have is this:

FILE1:
MICKEY
MINNIE
DONALD
GOOFY
PLUTO

FILE2:
MICKEY
MINNIE
DONALD
GOOFY
PLUTO
BUGS-BUNNY

So search FILE1 for all line entries of FILE2 then output whoever does not exist
into FILE3. So after running the script, I should have ...

FILE3:
BUGS-BUNNY

What I currently have is that I read all of FILE2's entries into an array? Read
FILE1 one line at a time using m/// and if there is no m///, print that to
FILE3.

It seems to work fine for 1000 lines of entries, but am not particularly sure
how long will that take for 8million+ rows, not even sure if I can create an
array to contain 8million+ plus rows, if I can't, then am back to doing this on
the database instead. Another option am looking at is just to read FILE1 one
line at a time and do a grep "$string_to_search" FILE2 but I do not know how to
do a grep-like syntax against a file on Perl especially if the search string is
a variable.

Why I prefer using a script is so am not putting loads into the database not to
mention that I can put more logic into the script than on the SQL statement.

Any advise or other options will be very much appreciated .... Thanks in
advance.

Obviously, you want to do this as efficiently as possiblebecause of the humongous size of the data. Hashes are thefastest structures for letting you know if some data hasalready been seen.

Your problem is core. Do you have enough core memory to readall of the data one of the table columns into memory? If so,then the solution be almost trivial; if not, then it'sprobably not trivial, but also not hard.

My advice is to attempt to suck the entire column for onetable into memory; FILE1 should become hash keys (with emptyvalues). Then you would open FILE2, loop through the lines(records) and output any record that does not appear in the hash.

However, if you can't get the entire FILE1 into memory, thenI'd suggest converting FILE1 into a berkeley database andusing DB_File to tie it to a hash; from there on, the solutionwould be like the above.



HTH



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Comparing two files of 8million lines/rows ...

Reply via email to