Re: Comparing two files of 8million lines/rows ...

Rob Dixon Thu, 17 Aug 2006 02:25:33 -0700

[EMAIL PROTECTED] wrote:
>
> I have two database tables, one is local and one is on a WAN. They are
> supposed to be in-sync but they at the moment, they are not. There are
> 8million+ plus rows on this table.
>
> I tried to do SELECT EMPNO FROM EMP WHERE EMPNO NOT IN ( SELECT EMPNO FROM
> [EMAIL PROTECTED] ), leave that running for hours and all thru the night and 
guess
> what, am still waiting for the result to come out ...
>
> So what I decided is I want to extract the records into a flat file and then
> write a Perl script to skim thru each line and check whether it exists on the
> other file. While there will be 8million+ lines, the file will not be big
> beacuse am only extracting one-column from the table.
>
> Does anyone have an existing Perl code that does a similar thing like this
> already? It will be much appreciated if you can send it to me and then I will
> just modify it accordingly.
>
> Example logic that I have is this:
>
> FILE1:
> MICKEY
> MINNIE
> DONALD
> GOOFY
> PLUTO
>
> FILE2:
> MICKEY
> MINNIE
> DONALD
> GOOFY
> PLUTO
> BUGS-BUNNY
>
> So search FILE1 for all line entries of FILE2 then output whoever does not
> exist into FILE3. So after running the script, I should have ...
>
> FILE3:
> BUGS-BUNNY
>
> What I currently have is that I read all of FILE2's entries into an array?
> Read FILE1 one line at a time using m/// and if there is no m///, print that
> to FILE3.
>
> It seems to work fine for 1000 lines of entries, but am not particularly sure
> how long will that take for 8million+ rows, not even sure if I can create an
> array to contain 8million+ plus rows, if I can't, then am back to doing this
> on the database instead. Another option am looking at is just to read FILE1
> one line at a time and do a grep "$string_to_search" FILE2 but I do not know
> how to do a grep-like syntax against a file on Perl especially if the search
> string is a variable.
>
> Why I prefer using a script is so am not putting loads into the database not
> to mention that I can put more logic into the script than on the SQL
> statement.
>
> Any advise or other options will be very much appreciated .... Thanks in
> advance.


No need for flat files if you are using DBI. Read the EMPNO values from
[EMAIL PROTECTED] first as it is a) slower, being on the network, and b) a 
shorter list,
and put these values into a Perl hash. 8 million hash values are unlikely to
take more than about 300MB of memory I would guess, which should be fine on any
recent PC. Fetching one value at a time will prevent there being two copies of
the data in memory at once (an array of retrieved values and the derived hash).

Then just read each EMPNO from EMP and print it out if it isn't in the hash.
This code fragment may help.
Cheers,

Rob


use strict;
use warnings;

use DBI;

my ($dsn, $user, $pass);  # Assign these as appropriate
my $dbh = DBI->connect($dsn, $user, $pass);

my %employee;

my $sth = $dbh->prepare('SELECT EMPNO FROM [EMAIL PROTECTED]');
$sth->execute;
while (my ($empno) = $sth->fetchrow_array) {
  $employee{$empno}++;
}

$sth = $dbh->prepare('SELECT EMPNO FROM EMP');
$sth->execute;
while (my ($empno) = $sth->fetchrow_array) {
  print "$empno\n" unless $employee{$empno};
}


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Comparing two files of 8million lines/rows ...

Reply via email to