If your system's memory is large enough to hold the smaller dataset,
then as others have said, working with hashes is the way to go:

    read all of small dataset into hash
    while another record in large dataset
        if key for record exists in hash
            delete hash{key}                           << result is thus
an XOR of keys
        else
             add record to hash
    write hash as the output file

If the amount of satellite data is too great for this approach to work,
I would use a two stage approach by reading only the key fields and
using "file1" and "file2" as the appropriate hash values. Then build the
output file by looking up each record in described in the hash. There
may be more efficient ways to do this, but I like to keep my thoughts
simple.

Will




----- Original Message -----
From: Steve Whittle <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, June 06, 2001 8:46 AM
Subject: Comparing two files


> Hi,
>
> I'm trying to write a script that removes duplicates between two files
and
> writes the unique values to a new file. For example, have one file
with the
> following file 1:
>
> red
> green
> blue
> black
> grey
>
> and another file 2:
>
> black
> red
>
> and I want to create a new file that contains:
>
> green
> blue
> grey
>
> I have written a script that takes each entry in file 1 and then reads
> through file 2 to see if it exists there, if not, it writes it to a
new
> file. If there is a duplicate, nothing is written to the new file. The
real
> file 1 I'm dealing with has more than 2 million rows and the real file
2 has
> more than 100,000 rows so I don't think my method is very efficient.
I've
> looked through the web and perl references and can't find an easier
way. Am
> I missing something? Any ideas?
>
> Thanks,
>
> Steve Whittle
>
>

Reply via email to